High-Performance Streaming Dictionary

ABSTRACT

A method, apparatus and computer program product for storing data in a disk storage system is presented. A high-performance dictionary data structure is defined. The dictionary data structure is stored on a disk storage system. Key-value pairs can be inserted and deleted into the dictionary data structure. Updates run faster than one insertion per disk-head movement. The structure can also be stored on any system with two or more levels of memory. The dictionary is high performance and supports with full transactional semantics, concurrent access from multiple transactions, and logging and recovery. Keys can be looked up with only a logarithmic number of transfers, even for keys that have been recently inserted or deleted. Queries can be performed on ranges of key-value pairs, including recently inserted or deleted pairs, at a constant fraction of the bandwidth of the disk.

1 BACKGROUND

This invention relates to the storage of information oncomputer-readable media such as disk drives, solid-state disk drives,and other data storage systems.

An example of a system storing information comprises a computer attachedto a hard disk drive. The computer stores data on the hard disk drive.The data is organized as tables, each table comprising a sequence ofrecords. For example, a payroll system might have a table of employees.Each record corresponds to a single employee and includes, for example,

1. First name (a character string),

2. Last name (a character string),

3. Social Security Number (a nine-digit integer),

4. A birth date (a date), and

5. An annual salary, in cents (a number).

The system might maintain another table listing all of the payments thathave been made to each employee. This table might include, for example,

1. Social Security Number,

2. Payroll date (a date), and

3. Gross pay (a number).

The employee table might be maintained in sorted order according tosocial security number. By keeping the data sorted, the system may beable to find an employee quickly. For example, the data were not sortedthen the system might have to search through every record to find anemployee. If the data is kept sorted, on the other hand, then the systemcould find an employee by using a divide-and-conquer approach, in thesame way that one can look up a phone number in a hardcopy phone book bydividing the book in two, and determining whether your party is in thefirst half or the second half, and then repeating thisdivide-and-conquer approach on the selected half.

The problem of efficiently maintaining sorted data can become moredifficult when disk drives or other real data storage systems are used.Storage systems often have interesting performance properties. Forexample, having read a record from disk, it is typically much cheaper toread the next record than it is to read a record at the other end of thetable. Many storage systems exhibit “locality” in which accessing a setof data that is stored near each other is cheaper than accessing datathat distributed far and wide.

This invention can be used to maintain data, including but not limitedto these sorted tables, as well as other uses where data needs to beorganized in a computer system.

2 SUMMARY

This invention can be used to implement dictionaries. Many databases orfile systems employ a dictionary mapping keys to values. A dictionary isa collection of keys, and sometimes includes values.

In some systems, when data is stored in a disk storage system, the datais stored in a dictionary data structure stored on the disk storagesystem, and data is fetched from the disk storage system by accessingthe a dictionary.

In some systems, there is a computer-readable medium havingcomputer-readable code thereon, where the code encodes instructions forstoring data in a disk storage system. The computer readable mediumincludes instructions for defining a dictionary data structure stored onthe disk storage system.

In some systems, a computerized device is configured to processoperations disclosed herein. In such a system the computerized devicecomprises a processor, a main memory, and a disk. The memory system isencoded with a process that provides a high-performance streamingdictionary that when performed (e.g. when executing) on the processor,operates within the computerized device to perform operations explainedherein.

Other systems that are disclosed herein include software programs toperform the operations summarized above and disclosed in detail below.More particularly, a computer program product can implement such asystem. The computer program logic, when executed on at least oneprocessor in a computing system, causes the processor to perform theoperations indicated herein. Such arrangements of logic can be providedas software, code and/or other data structures arranged or encoded on acomputer readable medium, or combinations thereof, including but notlimited to an optical medium (for example, CD-ROM), floppy or hard disk(for example, rotating magnetic media, solid state drive, etc.) or othermedia including but not limited to firmware or microcode in one or moreROM or RAM or PROM chips or as an Application Specific IntegratedCircuit (ASIC), networked memory servers, or as downloadable softwareimages in one or more modules, shared libraries, etc. The software orfirmware or other such configurations can be installed onto acomputerized device to cause one or more processors in the computerizeddevice to perform the techniques explained herein. Software processesthat operate in a collection of computerized devices, including but notlimited to in a group of data communications devices or other entitiescan also provide the system described here. The system can bedistributed between many software processes on several datacommunications devices, or all processes could run on a small set ofdedicated computers, or on one computer alone.

The system can be implemented as a data storage system, or as a softwareprogram, or as circuitry, or a some combination, including but notlimited to a data storage device. The system may be employed in datastorage devices and/or software systems for such devices.

The memory system of a computer typically comprises one or more storagedevices. Often the devices are organized into levels in a storagehierarchy. Examples of devices include registers, first-level cache,second-level cache, main memory, a hard drive, the cache inside a harddrive, tape drives, and network-attached storage. As technology developsother devices may be developed. Additional examples of storage deviceswill be apparent to one of ordinary skill in the art. In this patent, weoften describe the system as though it consists of only two levels in ahierarchy, and discuss how to optimize the number of transfers betweenone level and another. But the same analysis applies whether consideringtransfers from cache to main memory, or transfers from main memory todisk, or transfers between main memory and a hard drive, or transfersbetween any two storage devices, even if they are not organized intolevels in a hierarchy. And a memory hierarchy can comprise manydifferent levels. For convenience of description we will often refer toone device as RAM, in-RAM, in-memory, internal memory, main memory, orfast memory, whereas we will refer to a second level as disk, out ofmemory, on disk, or slow memory. It will be apparent to one of ordinaryskill in the art that a dictionary can be implemented to usecombinations of storage devices, such pairs including cache versus mainmemory, different parts of cache, main memory versus disk cache, diskcache versus disk, disk versus network attached storage, registersversus cache, etc. Furthermore, a dictionary can be implemented usingmore than two storage devices, for example using all of the storagedevices mentioned above. Instead of analyzing the number transfersbetween two devices which are adjacent in a storage hierarchy, one couldanalyze the transfers between non-adjacent levels of memory, or betweenany two devices of a memory system. Furthermore, there could be multipleinstances of each level, that is, there might be multiple caches, forexample one or more for each processor or there may be multiple disks.

3 BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing will be apparent from the following more particulardescription of preferred embodiments of the invention, as illustrated inthe accompanying drawings in which like reference characters refer tothe same parts throughout the different views. The drawings are notnecessarily to scale, emphasis instead being placed upon illustratingthe principles of the invention.

FIG. 1 depicts a configuration of processing and storage/memory elementsin a computer system.

FIG. 2 depicts a computer transforming data and queries into results.

FIG. 3 depicts data stored in a tree-structured dictionary.

FIG. 4 depicts the structure of a leaf node in RAM.

FIG. 5 depicts the structure of a packed-memory array supportingvariable sized elements (PMAVSE).

FIG. 6 depicts the structure of a smeared and squished representationsof a PMAVSE.

FIG. 7 depicts an example of two levels in the tree containing messagesin buffers before messages have moved down the tree.

FIG. 8 depicts an example of two levels in the tree containing messagesin buffers after messages have moved down the tree.

FIG. 9 depicts the structure of a path-based implementation of cursors.

FIG. 10 depicts the structure of a memory pool.

FIG. 11 depicts the structure of an order-maintenance tree (OMT).

FIG. 12 depicts a system using two layers of dictionaries.

FIG. 13 depicts a system in which messages are placed according to whatis in RAM in a on-tree-based dictionary.

FIG. 14 depicts a key-value implementation of cursors.

FIG. 15 depicts a system providing additional acknowledgments andfeedback upon updates.

FIG. 16 depicts a system in which there are concurrent insertions whichare uniformly randomly distributed in the keyspace.

FIG. 17 depicts a system in which concurrent insertions exhibit a skeweddistribution.

FIG. 18 depicts the structure of an order-maintenance tree representedby a tree.

FIG. 19 depicts how the system looks up a value in an OMT.

FIG. 20 depicts how the system looks up an index in an OMT.

FIG. 21 depicts how the system inserts into an OMT in an unbalancedfashion.

FIG. 22 depicts various types of leaf entries.

FIG. 23 depicts how the system computes checksum.

FIG. 24 depicts the inner loop of the checksum calculation using 64-bitoperations.

FIG. 25 depicts How the system computes a checksum in parallel,expressed in Cilk++.

FIG. 26 depicts the structure a nonleaf node in RAM.

FIG. 27 depicts a buffer structure.

FIG. 28 depicts various messages in a message format.

FIG. 29 depicts a message encoding example.

FIG. 30 depicts a block translation table.

FIG. 31 depicts a block translation pair.

FIG. 32 depicts a segment allocator.

FIG. 33 depicts a statistics structure for nonleaf nodes.

FIG. 34 depicts an example of insertion with nested transactions.

FIG. 35 depicts the messages sent in an example of insertion with nestedtransaction.

FIG. 36 depicts the processing of messages for an example of insertionwith nested transactions.

FIG. 37 depicts a leaf entry employing a placeholder.

FIG. 38 depicts an example with insertions and queries in nestedtransactions.

FIG. 39 depicts the processing of messages and queries.

FIG. 40 depicts an example with insertions and deletions.

FIG. 41 depicts the processing of messages for insertions and deletions.

FIG. 42 depicts an example with insertions and aborts.

FIG. 43 depicts the processing of insertions and aborts.

FIG. 44 depicts an example of a packed memory array.

FIG. 45 depicts an example of the PMA from FIG. 44 after an additionalkey has been added and a subarray has been rebalanced.

FIG. 46 depicts a buffer pool.

FIG. 47 depicts the format of some log entry types.

FIG. 48 depicts the format of some additional log entry types.

FIG. 49 depicts the format of more additional log entry types.

FIG. 50 depicts the format of even more additional log entry types.

FIG. 51 depicts the format of yet more additional log entry types.

FIG. 52 depicts recovery states

FIG. 53 depicts two disks implementing compact partitioning.

FIG. 54 depicts disk recycling.

FIG. 55 depicts disk recycling in an APMA mode.

FIG. 56 depicts a loader pipeline.

4 DETAILED DESCRIPTION

The invention may be practiced in a computer system that operates ondata stored in a computer. Typically a computer system, as illustratedin FIG. 1, comprises one or more processor chips (101), each of whichmay have one or more CPU cores. The processors are connected to a cache(102), a main memory (103) sometimes called RAM, and to a secondarymemory, which is often implemented as one or more hard disk drives(104). The disk drives may be implemented as rotating disks, and mayhave a processor and a cache in the disk drive. The cache in the diskdrive may include nonvolatile storage (e.g., with battery backup) sothat if the power fails, recent writes will not be lost. The disk drivesmay also be implemented as solid state storage. The system may also haveother storage including but not limited to magnetic tape or opticallaser disks. The disk drives may be connected to the processorsdirectly, or through a network. Depending on the configuration,different protocols are used to communicate between the processors andthe disks. Some systems use SCSI or SATA, and others use higher levelprotocols including but not limited to NFS (the Sun Network FileSystem), and some systems use still other protocols. Several distinctcomputers may work together over a network. Collectively, the disks,RAM, cache, and other data storage is referred to as a storagehierarchy. A program is provided to the processors, to direct theprocessors to perform the operations described below. The program may bestored at one more levels of the storage hierarchy. For example, copiesof the program may be stored on disk, on a DVD, and in RAM.

A system is said to cache a value if it stores the value in a fasterpart of the memory hierarchy rather than in a slower part of the memoryhierarchy (or rather than recomputing the value). For example, thesystem may cache blocks in the cache (102) from RAM (103). It may cachevalues in RAM (103) that might otherwise require accessing disk (104).Or if a value is expensive to compute, it may cache a copy of thatvalue, to avoid recomputing the value in the future.

In a typical mode of operation, the system operates as shown in FIG. 2.A computer (201) receives data from a data source (202), and stores thatdata in its storage hierarchy. One or more clients (203) form queries(204) (which may include commands to insert, modify, or delete data),which are input into the computer (201). The computer (201) operates onthe data stored in its storage hierarchy. Such operation may includesorting, indexing, scanning, and otherwise reading and writing the datain the hierarchy. The computer produces a result (205) which is returnedto the client.

In one mode of operation, the system organizes data in a tree structure.A tree comprises nodes. A node comprises a sequence of children. If anode has no children, then it is called a leaf node. If the node haschildren, it is called a nonleaf node or internal node. There is oneroot node that has no parent. All other nodes have exactly one parent.The tree nodes can have different on-disk and in-RAM representations.When a tree node is read from disk and brought into internal RAM, thenode is first converted from the on-disk data format to the in-RAM dataformat, if different. When a tree node is evicted from RAM and writtenback to disk, it is converted from the in-RAM data format back to theon-disk data format, if different.

FIG. 3 shows an example tree containing employee records. The treecomprises a root node (301), which comprises two pointers (302), one toa left child (303) and one to a right child (304), and a pivot key(305). The pivot key of the root node (301) is the social securitynumber 222-33-3333 indicating that all the social security numbers inthe subtree rooted at the left child (303) are at or before 222-33-3333,and all the social security numbers in the subtree rooted at the rightchild (304) are at or after 222-33-3333. Similarly the left child (303)comprises two pointers (one to leaf node (LeafA (306)), and one to leafnode (LeafB (307))) and a pivot key (305), and the right child (304)comprises two pointers and a pivot key.

Each leaf node (306, 307, 308, and 309 respectively) includes threeemployee records, each with a social security number, a name, and asalary. The leaf nodes collectively contain the employee records (310,311, 312, 313, 314, 315, 316, 317, 318, 319, 320, and 321).

To find a given employee, such as the one with social security number333-22-2222, the system examines the root node (301) and determines thatthe pivot key (305) stored therein is less than the employee's, so thesystem examines the right child (304), where it discovers that the pivotkey stored therein is greater than the employee's, so the systemexamines the left leaf node (308) of the right child (304) where it canfind the complete record of the employee.

When the system needs to examine a node, including the root node (301),the system may be storing the node on disk (104) or in memory (103). Ifthe node is not in memory (103), then the system moves a copy of thenode from disk (104) to memory (103), changing the node's representationif appropriate. If memory (103) becomes full, the system may write nodesback from memory (103) to disk (104). If the in-memory node has not beenchanged since it was read, the system may simply delete the node frommemory (103) if there is a still a copy on disk (104).

There are variations on this memory hierarchy. For example, the data canbe moved between any two or more different types of computer-readablemedia. Another example is that the system may sometimes store data todisk in the in-RAM format, and sometimes store data in the on-diskformat.

Alternatively, organizing data with different representations can beemployed for other structures besides trees. For example, if thedictionary is a cache-oblivious look-ahead array or a cascading array,then different in-RAM and on-disk representations could be employed fordifferent subarrays.

In the example of FIG. 3 every nonleaf node has two children. The systemorganizes trees with a varying number of children, with some nodeshaving fewer and some having more than two children. For example, nodesmight have hundreds or thousands or even more children.

In a tree, the height of a leaf node is 0. The height of a nonleaf nodeis one plus the maximum height of its children nodes. The depth of theroot node is 0. The depth of a nonroot node is one plus the depth of itsparent.

The system employs trees of uniform depth. That is, all of the leavesare the same depth. Alternatively, the tree could be of nonuniformdepth, or that system could employ another structure. For example asystem could employ a structure in which some nodes have two or moreparents, or in which a tree has multiple roots, or a structure whichcontains cycles.

The subtree of a tree rooted at node n comprises n and the subtreesrooted at the children of n. This implies the subtree rooted at a leafnode is the node itself. A tree or subtree can contain only one node, orit can contain many nodes.

Whenever a new key-value pair

k, v

is inserted into the dictionary, it logically replaces any previous pair

k, v′

that exists.

For dictionaries that allow duplicate keys other rules apply. Forexample, all the different key-value pairs may be kept, even though somehave the same key. In such a dictionary, pairs might be stored logicallyin sorted order. That is, record (k, v) is logically before record (k′,v′) if and only if k<k′ or (k=k′ and v<v′), where the comparisons aremade with the appropriate comparison functions.

The system supports dictionaries with no duplicate keys (NODUP) as wellas dictionaries with duplicate keys (DUP) which break ties by comparingvalues. in a DUP dictionary, inserting a duplicate key with a duplicatevalue typically has no effect.

A key is represented as a byte string. The user of the dictionary maysupply comparison functions for keys. If no comparison function issupplied by the user, then a default comparison function is used, inwhich keys are ordered lexicographically as byte strings.

Similarly values are represented as byte strings, and the user maysupply a comparison function for values.

A tree-structured data structure is organized as a search tree whennonleaf nodes of the tree comprise pivot keys (which may be keys orkey-value pairs or they may be substrings of keys or key-value pairs).If the tree has n children, then for 0≦i<n−1, the subtree rooted atchild i contains pairs that are less than or equal to pivot key i, andfor 1≦i<n, the subtree rooted at child i contains pairs that are greaterthan pivot key i−1. We say that a pair p belongs to child i if

1. i=0 and p is less than or equal to pivot key 0, or

2. i=n−1 and p is greater than pivot key n−1, or

3. 0<i<n−1 and p is less than or equal to pivot key i and greater thanpivot key i−1.

The system includes a front-end module that receives commands from auser and converts them to operations on a dictionary. For example, thefront-end a SQL database receives SQL commands which are then executedas a sequence of dictionary operations.

Dictionaries

The system implements a dictionary in which keys can be compared. Thatis, given two keys, they are either considered to be the same, or one isconsidered to be ordered ahead of the other. For example, if adictionary uses integers as keys, then the number 1 is ordered ahead ofthe number 2. In some dictionaries the keys are not ordered.

Another example is that a character string can be used as a key. Acharacter string is a sequence of characters. For example the string“abc” denotes the string s where the first character ‘a’, the secondcharacter is ‘b’, and the third character is ‘c’. We denote the firstcharacter as s₀ the second character as s₁, and so forth. Thus, in thisexample, indexing of strings starts at 0. Strings can be ordered using alexicographic ordering. Typically, in a lexicographically orderedsystem, two strings s and r are considered to be the same if they arethe same length (that is |5|=|r|) and the ith character is the same forall relevant values of i (that is s_(i)=r_(i) for all 0≦i<|s|). If thereis some index i such that s_(i)≠r_(i) then let j be the minimum suchindex. String s is considered to be ahead of string r if and only ifs_(i) is before r_(i). If there is no such i, then the remaining case isthat one of the strings is a prefix of the other, and the shorter stringis considered to be ahead of the longer one.

Another example is when one has a collection of vectors, all the samelength, where corresponding vector elements can be compared. One way tocompare vectors is that two vectors are considered the same if theirrespective components are the same. Otherwise, the first differingcomponent determines which vector is ahead of the other. In somesystems, the vectors may be of different lengths, or correspondingelements may not be comparable.

Alternatively, there exist many other ways of constructing keys.Examples include comparing the last element of a sequence first, orordering the keys in the reverse of their natural order (including butnot limited to ordering the integers so that the descend rather thanascend).

A dictionary can be conceived as containing key-value pairs. A key-valuepair comprises a key and a value.

One way to use dictionaries is for all the keys to be unique. Anotherway to use dictionaries allows keys to be duplicated for differententries. For example, some dictionaries might allow duplicate keys inwhich case ties are broken using some other technique including but notlimited to based on which pair was inserted earlier or based on thecontents of the values.

The same data can be stored in many different dictionaries. For some ofthese dictionaries, the role of the values comprising the key and valuemay be changed. For example, a key in one dictionary may be used as thevalue in another dictionary. Or the key may comprise the key of anotherdictionary concatenated or otherwise combined with parts of a value.Each dictionary may have an associated total ordering on the keys.Different dictionaries may contain the same key-value pairs, but with adifferent ordering function. For example, a system might employ twodictionaries, one of which is the reverse of the other. An example wouldbe a dictionary containing names of people as keys. A system mightmaintain one dictionary in which the names are sorted by last name, andanother in which the names are sorted by first name.

Given a key, a search operation can determine if a key is stored in adictionary, and return the key's associated value if there is one. Givena key, finding the corresponding key-value pair if it exists orreporting the nonexistence if it does not exist is called looking up thekey. It is also referred to as a search or a get. In some situations, alook up, search, or get may perform different operations (including butnot limited to not returning the associated value, or performingadditional operations). Given a key k, the system can find the successorof k in the dictionary (if there is one), which is the smallest keygreater than k in the dictionary. The system can also find thepredecessor (if there is one). Another common dictionary operation is toperform a range scan on the dictionary: Given two keys k, and k′, findall the key-value pairs

k″, v

, such that k≦k″≦k′. One way to perform a range scan is to first findthe successor k″ of k, and then find the successor of k″, and then findsuccessor of that key, and so forth, until a key larger than k′ isfound. Another way to perform a range scan is to find the predecessor ofk′, and then use subsequent predecessor operations to find the pairs inreverse order. Alternatively, there are other implementations of rangescans, including but not limited to using a cursor.

Typically a system performs a range scan in order to perform anoperation on each pair as it is found. An example operation is to sum upall the values when the values are numbers. Other examples are to make alist of all the pairs or keys or values; or to make a list of the firstelement of every value (for example, if the values are sequences); or tocount the number of pairs. Many other operations can be performed on thepairs of a range query. Some range scan operations can be more efficientif the values are produced in a particular order (for example, smallestto largest, or largest to smallest). For example, joining twodictionaries in a relational database can be more efficient if thedictionaries are traversed in a particular order. Other range scanoperations may be equally efficient in any order. For example, to countthe number of pairs in a range, the values can be found in any order.

There are several ways that dictionaries can deal with the possibilityof duplicate keys, that is key-value pairs with the same key.

For example, some dictionaries forbid duplicate keys. One way to forbidduplicate keys is to ensure that whenever a key-value pair

k, v

is inserted into the dictionary, it overwrites any previous value v′associated with key k. Alternatively, there are other ways to preventduplicate keys. For example, the dictionary could be left unchanged whena duplicate is inserted. Another example is to generate error when aduplicate is inserted.

Another way to handle duplicate keys is to extend the comparison keys toallow comparisons on key-value pairs. In this approach, duplicate keysare allowed as long as any two records with the same key have differentvalues, in which case a value comparison function is provided by thesystem to induce a total order on the values. Key-value pairs are storedin sorted order based on the key comparison function, and for duplicatekeys, based on the value comparison function. This kind of duplicationcan be employed, for example, to build an index in a relationaldatabase.

Alternatively, there are other ways to accommodate duplicate keys in adictionary. For example, a system might “break ties” by consideringpairs that were inserted earlier to be ordered earlier than pairs thatwere inserted later. Such a system could even accommodate “duplicateduplicates”, in which both the key and the value are equal.Alternatively, when storing pairs with duplicate keys, the key might bestored only once, which could save space, and for duplicate duplicatesthat the value could be stored only once which could save space.

Alternatively, other space-saving techniques can be employed. Forexample when keys and values are strings, often two adjacent keys sharea common prefix. In this case, the system could save space by storingthe common prefix only once.

The system employs tree structure to implement dictionaries. As thesystem traverses the tree from left to right, it encounters key-valuepairs in sorted order.

Leaf Node in Memory

FIG. 4 shows a leaf node (401) in RAM. The node is a structurecomprising a leaf data block (402), an order maintenance tree (OMT)(1101), and a memory pool (1001). The leaf data block (402) is astructure comprising

-   -   1. isdup (403), a Boolean that indicates whether the leaf node        is part of a DUP or a NODUP 365 tree;    -   2. blocknum (404), a 64-bit number that indicates which block        number is used to store the data on disk;    -   3. height (405), a 16-bit number that indicates the height of        the node (which is 0 for leaf nodes);    -   4. randfingerprint (406), a 32-bit number employed for        calculating fingerprints;    -   5. localfingerprint (407), a 32-bit number that provides a        fingerprint of the values stored in the leaf;    -   6. fingerprint (408), a 32-bit number that contains the        fingerprint of the node;    -   7. dirty (409), a Boolean that indicates that the in-RAM node        represents different data than does the on-disk node (that is,        that the node has been modified in RAM);    -   8. fullhash (410), a 32-bit number that is a hash value used to        find the leaf node in a Buffer Pool (4601);    -   9. nodelsn (411), a 64-bit number that equals the log sequence        number (LSN) associated with the most recent change to the node;    -   10. nbytesinbuffer (412), a 32-bit number that indicates how        many bytes of leaf entries are stored in the leaf node including        overheads such as length;    -   11. seqinsert (413), a 32-bit number that indicates how many        insertions have been performed with sequentially increasing        keys,    -   12. statistics (414), a structure that maintains statistical        information;    -   13. omt_pointer (415), a pointer to an OMT (1101); and    -   14. mem_pointer (416), a pointer to a memory pool (1001).

The system calculates the fingerprint (408) of a leaf node by taking thesum, over the leaf entries in the node, of the fingerprints of the leafentries. The fingerprint of a leaf entry in a node is taken by computinga checksum, for example as shown in FIG. 23, of the leaf entry, andmultiplying that checksum by the randfingerprint (406) of the node.

The system establishes the fingerprint seed randfingerprint (406) when anode is created by choosing a random number (e.g., with the random( ) Clibrary function, which in turn can be seeded e.g., with the date andtime.)

The fullhash (410) is a hash of the blocknum (404) and a dictionaryidentifier. The system employs fullhash (410) to look up blocks in thebuffer pool.

The system keeps track of how many insertions have been performed onsequentially increasing keys using the seqinsert (413) counter. Thesystem increments the counter whenever a pair is inserted at therightmost position of a node. Every time a pair is inserted elsewhere,the counter is decremented with a lower limit of zero. When a leaf nodesplits, if the seqinsert (413) counter is larger than one fourth of theinserted keys, the system splits the node unevenly.

Alternatively, other methods for maintaining and using such a countercan be employed. For example, the system could split unevenly if thecounter is greater than a constant such as four. For another example,the system could remember the identity of the most recently insertedpair, and increment the counter whenever a new insertion is adjacent tothe previous insertion. In that case the system when choosing a point tosplit a node, if the counter is large the system can split the node atthe most recently inserted pair.

Alternatively the particular sizes of the numbers chosen can be chosendifferently. For example, the nbytesinbuffer (412) field could be madelarger so that more than 2³² bytes could be stored in a leaf block.Similar size changes could be made throughout the system. In thefollowing description, we use the word “number” to indicate a numberwith an appropriate number of bits to represent the range of numbersrequired.

The system sets the dirty (409) Boolean to TRUE whenever the systemmodifies a node. When the system writes a node to node, it sets thedirty (409) Boolean to FALSE.

To insert a key-value pair into a leaf node (401), the system firstallocates space in the node's memory pool (1001) (which may invoke thememory pool's mechanism for creating a new internal buffer and copyingall the values to that space), and copies the value into the newlyallocated space. Then the memory-pool pointer to that value is stored inthe OMT (1101).

Memory Pool

FIG. 10 shows a memory pool (1001). A memory pool (1001) is a structurecomprising

-   -   1. mpbase (1005), a memory pointer to a memory block (1006);    -   2. mpsize (1003), a number which indicates the length of the        memory block;    -   3. freeoffset (1002), a number which indicates the beginning of        the unused part of memory block. All bytes in the memory block        beyond the freeoffset (1002) are not in use. In FIG. 10, the        free space is shown inside the memory block as a crosshatched        region marked “never yet unused” (1007); and    -   4. fragsize (1004), a number indicating how many allocated bytes        of the memory block are no longer in use. In FIG. 10 the hatched        regions (1009) are no longer in use. The fragsize (1004) is the        sum of the sizes of the blocks no longer in use. The blocks that        are still in use (1008) are also shown.

To allocate n bytes of memory in a memory pool, the system incrementsfreeoffset (1002) by n. If the freeoffset (1002) is not larger thanmpsize (1003), then the memory has been allocated. Otherwise a new blockof memory is allocated (using for example the system's standard librarymalloc( ) function) of size 2·(freeoffset−fragsize), and all useful datais copied from the old memory block to the beginning of the new memoryblock. The useful data can be identified as pointer values stored in theOMT (1101). The mpbase (1005) is set to point at the new memory block,and the old memory block is freed. The mpsize (1003) is set to the newsize, the freeoffset (1002) is set to (freeoffset−fragsize), and thefragsize (1004) is set to 0.

To free a subblock of size n of memory in a memory pool, the systemincrements the fragsize (1004) by n.

Order Maintenance Tree

An order-maintenance tree (OMT) is an in-memory dictionary. An OMT hastwo representations: a sorted array, and a weight-balanced tree. An OMTcan insert and look up a particular key-value pair by using thecomparison function on pairs.

An OMT can also look up the ith key-value pair, knowing only i(similarly to an array access). For example, an OMT can look up thethird value in the sorted sequence of all the values. Also an OMT caninsert a pair after the ith pair.

FIG. 11 shows an OMT. An OMT is a structure comprising

-   -   1. is_array (1102), a Boolean indicating whether the OMT uses        the array or tree representation;    -   2. omt_cursors (1103), a linked list of OMT cursors;    -   3. omt_array (1104), a pointer pointing to the array (in the        case that is_array (1102) is TRUE); and    -   4. omt_tree (1105), a pointer pointing to the tree (in the case        that is_array (1102) is FALSE).        Since omt_array (1104) and omt_tree (1105) are never both valid        at the same time, the same memory can be used to hold both        points using, for example, a C-language union.

In the array representation, an OMT's omt_array (1104) pointer points ata sorted array of key-value pairs. To look up a key, perform a binarysearch on the array. To look up the ith value, index the array using i.To insert or delete a value, first convert the OMT into the treerepresentation, and then insert it.

FIG. 18 shows the tree representation (1801) of an OMT. The treecomprises zero or more OMT nodes (1802, 1803, 1804, 1805, 1806). Eachnode is a structure comprising

-   -   1. omt_weight (1807), a number which is the size of the subtree        rooted at this OMT node;    -   2. omt_left (1808), a pointer to the left subtree (the pointer        is NULL if the left subtree is empty);    -   3. omt_right (1809), a pointer to the right subtree (the pointer        is NULL if the right subtree is empty); and    -   4. omt_keyvalue (1810), a key-value pair.

By convention, if there is no left (or right) child of a node, we saythat the left (respectively right) child is NULL.

The OMT tree is a search tree, meaning that all the pairs in the leftsubtree of a node are less than the pair of the node, and the value ofthe node is less than all the pairs in the right subtree.

We define the left-weight of an OMT node to be one plus the number ofnodes in its left subtree. The left-weight of a node N can be calculatedby examining the pointer in omt_left (1808). If that is NULL then theleft-weight of N is zero. Otherwise the left weight is the value storedin omt_weight (1807) of the OMT node pointed to by omt_left (1808).

We define the right-weight of an OMT node to be one plus the number ofnodes in its right subtree.

The OMT tree is weight balanced, meaning that left-weight is within afactor of two of the right-weight for every node in the OMT tree. Forexample Node (1802) has left-weight equal to 4 (it has threedescendants, plus 1), and right-weight equal to 2 (one descendant plus1). Since 2 is within a factor of two of 4, Node (1802) is weightbalanced.

Given a pair p, to find the index of that pair in an OMT rooted at NodeN, the system performs a recursive tree search, starting at the root ofthe OMT, as shown in FIG. 19. In this function (1901),

-   -   1. if p is less than the omt_keyvalue (1810) of a node, then the        index can be found by looking in the left child of the node,        omt_left (1808), as shown at Line 2;    -   2. if p equals the omt_keyvalue (1810) of the node, then the        index equals the left-weight of of the node because that is how        many values are in the tree that are less than p, as shown at        Line 4; and    -   3. otherwise the index equals one plus the left-weight of the        node plus the index of the node in the right subtree, as shown        at Line 5.

To look up a value given an index i in an OMT rooted at Node x, thesystem traverses the tree recursively, as shown in FIG. 20. In thisfunction (2001),

-   -   1. if i is less than the left-weight of N, then the value can be        found by looking in the left child of the node, omt_left (1808),        as shown at Line 2;

2. if i equals the left-weight of N, then the omt_keyvalue (1810) storedat N is returned, as shown at Line 4; and

-   -   3. otherwise, the value can be found by searching in the right        subtree, omt_right (1809), with an index equal to i minus the        left-weight of N minus 1, as shown at Line 5.

To look up a value given a pair p, one can first find the index usingOMT_FIND_PAIR function shown in FIG. 19, and then use that index withthe OMT_FIND_VALUE function shown in FIG. 20.

To insert a pair p into an OMT tree, the system first inserts the nodein an unbalanced fashion, and then rebalances the tree if needed byrebuilding the largest unbalanced subtree.

FIG. 21 shows how to insert, in an unbalanced fashion, a value after aparticular index given a node N, an index i that will be the index ofthe new value, and the value v that is to be inserted. The function(2101) returns the new tree. In this function,

-   -   1. if N is NULL then a new OMT node is created, as shown at Line        2;    -   2. if i is less than or equal to the left weight, then the value        is inserted in the left child of N at index i, and the resulting        tree is stored as the left child of N, as shown in Line 5, and N        is returned as shown in Line 6; and    -   3. otherwise the value is inserted in the right child of N, at        the index i minus the left-weight of N minus 1, and the        resulting tree is stored as the right child of N, as shown in        Line 8, and N is returned as shown in Line 9.

After performing the unbalanced insertion, any unbalanced subtree isrebalanced. As the OMT_INSERT_UNBALANCED function returns, it recordsthe highest unbalanced subtree, which is then rebalanced. If there issuch a node, then the system takes the highest unbalanced node, x, (thenode closest to the root), and rebuilds that subtree as a balanced tree.The system performs this rebuild by allocating an array large enough tohold all of the values in x, then scanning the subtree from left toright filling in the array. If the entire tree is being rebalanced, thenthe array is used to build the OMT in the array representation.Otherwise, after filling in the array, the subtree is then rebuilt byinserting the elements back into a tree to build a balanced tree. Thisbalanced tree construction is accomplished by inserting the middleelement of the array into the tree, and then recursively inserting theleft subarray into the tree, and then the right subarray into the tree.

Alternatively, one can delete a pair from the OMT array by firstdeleting in an unbalanced fashion and then rebuilding the largestunbalanced subtree.

An OMT cursor can be thought of as a pointer at a pair inside an OMT.Given a cursor, the system can fetch the pair, or it can move the cursorforward or backward. The system implements a cursor as an integer, whichis the index in the array representation. Since that index can be usedboth the tree and the array representation, it suffices. However anytime the tree changes shape (due to an insertion or deletion), thatinteger must be updated. When this occurs, the system invalidates theOMT cursor, and then the user of the OMT cursor reestablishes the cursorby looking up the relevant key-value pair. The OMT cursor provides acallback method to notify its user that the cursor is about to becomeinvalid, so that the user can copy out the key-value pair, which willenable the user to reestablish the cursor. Alternatively the system canupdate the integer as needed, or otherwise maintain the cursor in avalid state.

All of the OMT cursors that refer to a given OMT, are maintained in alinked list stored at omt_cursors (1103).

Leaf Entries

The objects stored in an OMT can have extra information beyond thekey-value pairs themselves. These objects, which comprise the key-valuepairs and any additional information, and are called leaf entries, andthey are looked up in an OMT using the same key comparison used forkey-value pairs. That is, for NODUP dictionaries, they are identified bya key, and for DUP dictionaries they are identified by a key-value pair.In this system, the extra information records whether the transactionthat last inserted or deleted the key-value pair has committed oraborted.

FIG. 22 shows various types of leaf entries. Each leaf entry isserialized into a block of memory. This encoding is called theserialized representation of the leaf entry. The first byte is a tagle_tag (2201) that is used to discriminate between the different typesof leaf entries. FIG. 22 shows four tags, which are LE_COMMITTED(encoded as 1), LE_BOTH (encoded as 2), LE_PROVDEL (encoded as 3), andLE_PROVVAL (encoded as 4).

1. A LE_COMMITTED leaf entry then encodes

-   -   (a) a key le_key (2202) that is encoded by encoding a numeric        length and the key bytes, and    -   (b) a committed value le_cvalue (2204) that is encoded by        encoding a numeric length and the key bytes.

2. A LE_BOTH leaf entry then encodes

-   -   (a) a key le_key (2202),    -   (b) a transaction identifier (XID) le_xid (2203) that encodes a        64-bit number,    -   (c) a committed value le_cvalue (2204), and    -   (d) a provisional value le_pvalue 2205 that is encoded by        encoding a numeric length and the key bytes.

3. A LE_PROVDEL leaf entry then encodes

-   -   (a) a key le_key (2202),    -   (b) an XID le_xid (2203), and    -   (c) a committed value le_cvalue (2204).    -   4. a LE_PROVVAL leaf entry then encodes    -   (a) a key le_key (2202),    -   (b) an XID le_xid (2203), and    -   (c) a provisional value le_pvalue 2205.        These four leaf entry types further comprise a checksum        le_checksum (2206).

Alternatively, other encodings can be used to implement a dictionary.For example, for a dictionary without transactions, it may suffice toemploy only one type of leaf entry comprising a key, a value, and achecksum.

Alternatively, the checksum can be modified to be more robust or lessrobust (or even removed). For example, if the reliability demanded byusers of the system is much less than the reliability provided by thesystem, then the checksum might be removed to save cost.

Checksums

A checksum le_checksum (2206) can be calculated using any convenientchecksum, such as a CRC. The system computes a checksum of a block ofmemory B of length l, calculated as shown in FIG. 23. The functionCHECKSUM (2301) calculates a 32-bit unsigned checksum. In Line 1 a64-bit unsigned variable s is set to zero. In Line 2 a number variable iis set to zero. Line 3 starts a loop that will run while i<l. Line 4sets a 64-bit unsigned variable v to zero. Line 5 sets a number variablej to zero. Line 6 starts a loop that will run as long as i<l and j<8.Line 7 fetches the ith byte of Block B, and left-shifts it by 7−j bytesby multiplying it by 2^(7−j), and then adds that value to v. Lines 8-9increment i and j respectively. After the inner loop finishes, Line 10updates s by multiplying s by 17 and adding v, and then truncating anyhigher-order bits so that the value fits into s. Line 10 converts the64-bit value s to a 32-bit checksum by taking the high-order 32 bits ofs and combining them with the low-order 32-bits of s using a bitwiseexclusive or operator, and then inverting the bits, and returning theresult.

For simplicity of further explanation, we focus the inner loop of thechecksum, which after optimizing for operating directly 64-bit valuescan be expressed in the C99 programming language as shown in FIG. 24. Ifwe include only 64-bit values, the checksum can be expressedmathematically as

$\sum\limits_{i}{a_{i} \cdot 17^{i}}$

where a_(i) is the ith 64-bit number. The function (2401) of FIG. 24computes this checksum.

To compute the same checksum in parallel the system operates as follows.If a and b are vectors of 64-bit values, and a+b is the concatenation ofvectors, and |b| is the length of b then

checksum(a+b)=checksum(a)·17^(|b|)+checksum(b)

where all calculations are performed in 64-bit unsigned integerarithmetic.

The system computes 17^(x) by repeated squaring. For example

17¹⁰⁰=17⁶⁴·17³²·17⁴

so to compute it the system computes

x ₂=17·17;

x ₄ =x ₂ ·x ₂;

x ₈ =x ₄ ·x ₄;

x ₁₆ =x ₈ ·x ₈;

x ₃₂ =x ₁₆ ·x ₁₆;

x ₆₄ =x ₃₂ ·x ₃₂;

x ₁₀₀ =x ₆₄ ·x ₃₂ ·x ₄;

Thus the system computes 17^(x) modulo 2⁶⁴ in O(log x) 64-bitoperations.

Note that the “big-Oh” notation is used to indicate how fast a functiongrows, ignoring constant factors. Let f(n) and g(n) be non-decreasingfunctions defined over the positive integers. Then we say that f(n) isO(g(n)) if there is exist positive constants c and n₀ such that for alln>n₀, f(n)<cg(n)

FIG. 25 shows how to compute this checksum function in parallel usingthe Cilk++ programming language. In this language the cilk_for loop islike a C99 for loop, except that all the iterations can run in parallel.The x1764 function (2501) divides the work into 32 pieces, which canthen be computed in parallel, and combined at the end.

Non-Leaf Nodes

FIG. 26 shows a nonleaf node (2601) in RAM. The node comprises a nonleafdata block (2602), an array child information array (2603), and an arraypivot keys array (2604).

A nonleaf data block (2602) is a structure comprising

1. isdup (403), a Boolean;

2. blocknum (404), a number;

3. height (405), a number;

4. randfingerprint (406), a number;

5. localfingerprint (407) a number;

6. fingerprint (408), a number;

7. dirty (409), a Boolean;

8. fullhash (410), a number;

9. nodelsn (411), a number; and

10. statistics (414) a structure.

all of which serve essentially the same role as in the leaf node of FIG.4. The nonleaf data block (2602) further comprises

1. nchildren (2606), a number indicating how many children the node has;

2. totalpivotkeylens (2607), a number indicating a sum of the lengths ofthe pivot keys;

3. nbytesinbufs (2608), a number indicating a sum of the number of bytesin buffers;

4. pivotkeys (2610), a pointer to an array of pivot keys; and

5. childinfos (2609), a pointer to an array of structures containinginformation for each child.

The pointer childinfos (2609) refers to a child information array (2603)in RAM. The ith element, a child information structure (2605), of thearray is a structure that contains information about the ith subtree ofthe node, comprising

-   -   1. subtreefingerprint (2611), a number which equals the        fingerprint of the subtree;    -   2. childblocknum (2613), a number which equals the block number        where the subtree's root is stored;    -   3. childfullhash (2614), a number which equals a hash of the        subtree root, used to quickly find the leaf node in a Buffer        Pool (4601);    -   4. bufferptr (2615), a pointer to a buffer buffer structure        (2701); and    -   5. nbytesinbuf (2616), a number equal to the sum of the number        of bytes in buffer structure (2701).        If the node has n children then the child information array        (2603) contains n structures, each a child information structure        (2605). The value of nbytesinbufs (2608) is the sum of the        various nbytesinbuf (2616) values in the child information array        (2603). In FIG. 26 the child information array (2603) is shown        with three elements, labeled 0, 1, and 2. Each element is a        child information structure (2605).

The pointer pivotkeys (2610) refers to a pivot keys array (2604) ofpivot keys. For a NODUP dictionary a pivot key comprises the key of akey-value pair. For a DUP dictionary a pivot key comprises both the keyand the value of a key-value pair. If the node has n children then thepivot keys array (2604) contains n−1 pivot keys. In FIG. 26 the pivotkeys array (2604) is shown with two slots which can each hold one pivotkey. The array of pivot keys is maintained in sorted 650 order. Let

k₁, . . . , k_(a)

denote the sequence of pivot keys in a nonleaf node, then k₁≦ . . .≦k_(a). For NODUP dictionaries k₁< . . . <k_(a).

FIG. 27 shows a buffer structure (2701), which implements afirst-in-first-out (FIFO) of messages (2708). A FIFO is a structurecomprising

-   -   1. n_in_fifo (2702), a number indicating how many messages        (2708) are in the FIFO;    -   2. fifo_memory (2703), a pointer to a block of memory (2707)        holding the messages (2708);    -   3. fifo_mem_size (2704), a number that says how big the block of        memory is;    -   4. fifo_start (2705), a number that indicates the offset in the        block of the oldest message; and    -   5. fifo_end (2706), a number that indicates the offset in the        block of just beyond the newest message (that is, the offset        where the next message will be enqueued).        A buffer structure (2701) contains zero or more messages.

In FIG. 27 the FIFO is shown holding three entries of various sizes.

To enqueue a message of size M in a FIFO, the system uses the followingprocedure:

-   -   1. Let S be the size of the data in the FIFO (that is, the        difference between fifo_end (2706) and fifo_start (2705)).    -   2. Let R be the size remaining space after the newest message        (that is, the difference between fifo_mem_size (2704) and        fifo_end (2706)).    -   3. If M<R (there is not enough space at the end of the memory        block), then        -   (a) If either 2(S+M) is greater than fifo_mem_size (2704) or            4(S+M) is less than fifo_mem_size (2704), then allocate a            new block of memory of size 2(S+M) and            -   i. copy the block of size S from offset fifo_start                (2705) in the old block to the beginning of the new                block,            -   ii. copy the message to offset S in the new block,            -   iii. set fifo_memory (2703) to point at the new block,            -   iv. set fifo_mem_size (2704) to 2(S+M),            -   v. set fifo_start (2705) to 0,            -   vi. set fifo_end (2706) to S+M, and            -   vii. free the old block.        -   (b) Otherwise (reuse the same block)            -   i. move the block of size S from offset fifo_start                (2705) in the block, copying from left to right to avoid                overwriting one portion of the block in its old location                before copying it to its new location,            -   ii. copy the message to offset S,            -   iii. set fifo_start (2705) to 0, and            -   iv. set fifo_end (2706) to S+M.    -   4. Otherwise there is space, therefore        -   (a) copy the message to offset fifo_end (2706), and        -   (b) increment fifo_end (2706) by M.

The fingerprint (408) of a nonleaf node is calculated by taking the sum,over all the messages in the node, of the fingerprints of the messages,further summing the fingerprints of the children nodes of the node. Thesystem maintains a copy in each node of the fingerprint of each child insubtreefingerprint (2611). The fingerprint is calculated incrementallyas the tree is updated. Alternatively, the fingerprint of a node can beupdated when the node is written to disk (also updating thesubtreefingerprint (2611) at that time).

The system maintains the fullhash (410) for a node and update thechildfullhash (2614) of the node's parent so that the recalculation ofthe fullhash (410) of the child can be avoided when the system isrequesting a child block from the buffer pool.

Messages

FIG. 28 shows a message format. A message can be one of several types,including but not limited to the following:

An insert message (2801) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID;

3. key (2810), a key; and

4. value (2811), a value.

A delete_any (2803) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID; and

3. key (2810), a key.

A delete_both (2804) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID;

3. key (2810), a key; and

4. value (2811), a value.

A commit_any (2805) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID; and

3. key (2810), a key.

A commit_both (2806) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID;

3. key (2810), a key; and

4. value (2811), a value.

An abort_any (2807) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID; and

3. key (2810), a key.

An abort_both (2808) message is a structure comprising

1. message_type (2809);

2. transaction_id (2812), an XID;

3. key (2810), a key; and

4. value (2811), a value.

An abort_any_any (2802) is a structure comprising

1. message_type (2809); and

2. transaction_id (2812), an XID.

Each message is encoded into a block of RAM. The message_type (2809)discriminates between the various types of messages, for example,between a commit_any message and an abort_both message. The messageformat in RAM is organized so that the message_type (2809) is at thesame offset in every message, so that the system can, given a block ofmemory containing an encoded message, determine which message type themessage is, and the system can to determine the offset of each of theother fields. The message_type (2809) is 1 for an insert, 2 for adelete_any, 3 for a delete_both, and so forth.

A message is encoded into a block of memory by encoding each of itsfields, one after the other. Thus the first byte of memory contains themessage_type (2809). The XID, which is a 64-bit number, is stored in thenext 8 bytes. The key is then stored using 4 bytes to store the lengthof the key, followed by the bytes of the key. The value, if present, isthen stored using 4 bytes to store the length of the value, followed bythe bytes of the value. Integers are stored in network order (mostsignificant byte first).

FIG. 29 shows an example encoding (2901) of an insert message (2801)using this encoding. The key is “abc”, which is of length three, and thevalue is “wxyz”, which is of length four. The XID is 1042. The values inthe figure are written in hexadecimal notation

The message encoding example (2901) can be read as follows:

-   01: The message_type (2809).-   00 00 00 00 00 00 04 12: The transaction_id (2812), which is 1042    encoded in hexadecimal.-   00 00 00 03: The length of the key, which is 3.-   61 62 63: The three bytes of of the key. Note that the letter ‘a’    encodes as hexadecimal 61 in ASCII.-   00 00 00 04: The length of the value, which is 4.-   760 77 78 79 7A: The four bytes of the value.

When a data structure, including but not limited to a message, has beenconverted into an array of bytes we say the data structure has beenserialized. In many other cases throughout this patent, when we describea data structure as being serialized, we use a similar technique asshown here for message serialization.

The system identifies nodes by a block number. The system converts ablock number to to a file offset (or a disk offset) and length via ablock translation table. The file offset and length together are calleda segment.

Alternatively, for some message types, the system could combine messagesat nonleaf nodes. For example if two insert messages with the same key,value, and XID are found, then only one needs to be kept.

Alternatively, there are other types of operations that can be stored asmessages. For example, one could implement a lazy query, in which thequery is allowed to be returned with a long delay. Alternatively, onecould implement an insertion of a key value pair (k, v) that is subjectto different overwrite rules, i.e., different rules about when tooverwrite a key-value pair (k, v′) that was already in the dictionary.One could implement an update operation U(k,v,c), in which c is acall-back function that specifies how the value v is combined with theexisting value of key k in the database. For example, this updatemechanism can be used to implement a counter increment functionality.There can also be addition types of operations for the case whenduplicates are allowed.

Block Translation Table

FIG. 30 shows a block translation table (BTT) (3001. Block numberscomprise 64-bit values. The block translation table (3001) is astructure comprising

-   -   1. free_blocks (3002), a block number which implements the head        of a free list;    -   2. unused_blocks (3003), a block number which indicates that all        block numbers larger than the unused_blocks (3003) value are        free;    -   3. xlated_blocknum_limit (3004), a number which indicates the        number of block numbers that are translated;    -   4. block_translation (3005), a pointer to an block translation        array (3009) stored in RAM;    -   5. xlation_size (3006), a number which indicates the size of the        block translation array (3009) as stored on disk;    -   6. xlation_offset (3007), a number which indicates where on disk        the block translation array (3009) is stored (thus the        xlation_offset (3007) and xlation_size (3006) together identify        a segment); and    -   7. block_allocator_pointer (3008), a pointer that points to a        segment allocator (3201).        The block translation array (3009) comprises an array of block        translation pairs.

FIG. 31 shows a block translation pair (3101), which is a structurecomprising

1. offset (3102), a number which indicates the offset of the block ondisk; and

2. size (3103), a number which indicates the size of the block.

A block translation pair (3101) thus contains enough information toidentify a segment on disk.

To implement a free list, the free_blocks (3002) in the blocktranslation table (3001) names a free block number. A free block has itssize (3103) set to a negative value in its block translation pair(3101), and has the identity of the next free block is stored in itsoffset (3102). The last free block in the chain sets its offset (3102)to a negative value.

To allocate a new block number, the system first checks to see iffree_blocks (3002) identifies a block or has a negative value. If itdoes, then the list is popped, setting free_blocks (3002) from theidentified block's offset (3102), and using the old value of free_blocks(3002) as the newly allocated block number. If there are no free blocksin the block list, then the block number named unused_blocks (3003) isused, and unused_blocks (3003) is incremented. If unused_blocks (3003)is larger than xlated_blocknum_limit (3004), then the block translationarray (3009) is grown, by allocating a new array that is twice as big asxlated_blocknum_limit (3004), copying the old array into the new array,freeing the old array, and storing the new array into block_translation(3005).

To free a block number, the block is pushed onto the block free list bysetting the block's offset (3102) to the current value of free_blocks(3002), and setting free_blocks (3002) to the block number being freed.

When the block translation array (3009) is written to disk, then asegment is allocated using the segment allocator (3201), and the blockis written. The size of the segment is stored in xlation_size (3006),and the offset of the segment is stored in xlation_offset (3007).

Alternatively, other implementations of a set of free blocks can beused. For example, the set of free blocks could be stored in a hashtable. Similarly, the translation array could be representeddifferently, for example in a hash table.

The segment allocator (3201) implements a segment allocator (3201) whichmanages the allo-cation of segments in a file. A segment allocator(3201) is a structure comprising

-   -   1. ba_align (3202), a number which indicates the alignment of        all segments;    -   2. ba_nsegments (3203), a number which indicates how many        allocated segments are on disk;    -   3. ba_arraysize (3204), a number which indicates the size of the        array pointed to by ba_arrayptr (3205);    -   4. ba_arrayptr (3205), a pointer which points at an array of        segment pairs; and    -   5. ba_nextfit (3206), a number which remembers in the array of        segment pairs the system last looked for an allocated segment.        The system ensures that every segment's offset is a multiple of        ba_align (3202). Whenever the array of segment pointers is        discovered to be too small (based on ba_arraysize (3204)), the        system doubles the size of the array by doubling ba_arraysize        (3204), allocating a new array, copying the data to from the old        array to the new array, and freeing the old array. The array of        segment pairs is kept in sorted order, sorted by the segment        offset. A segment pair comprises an offset and a size.

To find a new segment of size S, the system rounds S up to be a multipleof ba_align (3202), that is the system uses ba_align•┌S/ba_align┐. Thesystem then looks at the segment pair identified by the ba_nextfit(3206). The system can determine that the size of the unused spacebetween the segment named in that segment pair and the segment named inthe next segment pair. If the unused space is size S or larger, then aall the segment pairs from that point are moved up in the array by oneelement, creating a new segment pair. The new segment is theninitialized with size S and offset at the end of the original segmentpair. If the unused space is smaller than S (possibly with no space)then ba_nextfit (3206) is incremented wrapping around at the end, andthe system looks again. If the system makes one complete round lookingat all the free slots without finding a large enough free segment, thenthe system allocates space at end.

The system does not allocate a segment that has offset between 0 andba_reserve (3207), reserving that segment for file header information(including but not limited to information about where the blocktranslation table is stored on disk).

In the segment allocator (3201) described above, the free space isstored implicitly by storing the in-use segments in sorted order. Insome situations the system stores the free segments explicitly in an OMTsorted in increasing order by the size of the free segment. In thismode, the system allocates a segment of size S by performing a search tofind the smallest free segment of size greater than equal to S. Thefound segment is removed from the OMT. If the found segment's size isequal to size S then that segment is used. If the found segment islarger than S then the system breaks the segment into two parts, one ofsize S which is used, and the other which is the remaining unused space.The unused segment is stored in the OMT.

When a node with block number b is written to disk, it is firstserialized into a string of bytes of length U, then it is compressed,producing another string of bytes of length C. Then the 4-byte encodingsof C and U are prepended to the compressed string, yielding a string oflength C+8. Then a segment of size D=C+8 is allocated, and stored in theblock translation table, recording the segment for block number b, alongwith the length of the segment (C+8). Then the sequence is written todisk at the segment.

To read a block from disk to RAM, the system consults the blocktranslation table to determine the segment on disk holding thecompressed data. The length, D, of the compressed block with theprepended lengths is also retrieved from the block translation table. Ablock of RAM of size D is allocated, and the data is read from thesegment on disk into the RAM block. Then the size, U, of theuncompressed block is obtained from Bytes 4-7 of the retrieved block.Then a block of size U is allocated in RAM. The D-sized RAM block isdecompressed into the U-sized RAM block. Then the D-sized RAM block isfreed. The U-sized RAM block is then decoded into an in-RAM datastructure. For leaf nodes, which have a memory pool, the U-sized blockis used for the memory block (1006) of the memory pool.

For each dictionary, the system maintains a block translation table(BTT). In some modes of operation, the system maintains a checkpointedblock translation table (CBTT). And in some modes of operation thesystem maintains a temporary block translation table (TBTT).

Pushing Messages

The system composes messages and then executes them on the root node ofa dictionary. Executing a message on a node of the dictionary may resultin the message being incorporated into the node, or other actions beingtaken, as follows.

To execute a sequence of messages on a nonleaf node N that is in RAM:

1. For each message in the sequence:

-   -   (a) The system examines the pivotkeys (2610) to determine which        child or children to which a copy of the message shall be sent.        -   i. Insert messages (2801) are sent to the ith subtree where            the i is the largest number such that ith pivot key is            greater-than-or-equal to the message (note that for DUP            trees comparing the pivot key involves comparing both the            key and the value of the message). If all the pivot keys are            less than the message, then the message is sent to the            rightmost subtree.        -   ii. Any of the “_both” messages (delete_both (2804),            commit_both (2806), and abort_both (2808)) are sent to the            same subtree that an insert message would be 890 sent to.        -   iii. In the case of an “_any” message (delete_any (2803),            commit_any (2805), and abort_any (2807)) for NODUP trees,            the message is sent to the same subtree that an insert            message would be sent to.        -   iv. In the case of an “_any” message (delete_any (2803),            commit_any (2805), and abort_any (2807)) for DUP trees, the            message may be sent to more than one subtree. The subtrees            include any subtree i for which the key part of the ith            pivot is greater-than or equal to the key of the message and            the key part of the i−1st pivot key is less than or equal to            the key of the message. The message is sent to the leftmost            subtree if the first pivot key is greater than or equal to            the message, and is sent to the last subtree is the last            pivot key is less than or equal to the message.        -   v. For an abort_any_any (2802), the message is sent to all            the subtrees.    -   (b) The system copies the message into the respective message        buffers of all the children identified in Step 1a.

2. For each Child C of Node N:

-   -   (a) If the node of Child C is in RAM and is dirty and is not        temporarily locked or otherwise inaccessible, then        -   i. Let B be the buffer in N corresponding to the child.        -   ii. While B is not empty and the oldest message in B can fit            into the child without exceeding the child's target size            (even when the message is replicated many times 910 in Step            1b):            -   A. Dequeue the oldest message from B's FIFO.            -   B. Construct a sequence of length one containing that                message.            -   C. Execute the sequence on the child of the node.

3. If node N is larger than its target size:

-   -   (a) Find the child with the largest value of nbytesinbuf (2616)        (which corresponds buffer with the most bytes in its FIFO). (If        all the child FIFOs are empty, then the system is finished with        N.)    -   (b) Let B be the buffer of that child.    -   (c) Construct a sequence of messages by dequeueing some        (possibly all) of the messages 920 in B's FIFO, the first        element of the sequence being the oldest message in the FIFO,        and the last element of the sequence being the newest message in        the FIFO. (Note: the FIFO is now empty.)    -   (d) Bring the node of the child into RAM if it is not in RAM.    -   (e) Execute the sequence of messages on the child node.

Alternatively, variants of these rules can be employed. For example, inStep 2a the system could ignore whether the child is dirty ortemporarily inaccessible. Another example is that the system could,whenever it finds a nonempty buffer B in a node where the correspondingchild is dirty unlocked and in memory, remove all the messages from Band execute them on the child.

Emptying a buffer in a node by moving messages to the child is calledflushing the buffer.

It is possible that a node will be larger than their target sizes afterexecuting a sequence of messages on the node. For example, an abort_anymessage may be replicated many times in Step 1b. Then in Step 3, thebuffer of only one child is emptied. The node could still be larger thanits target size, which is acceptable, because the system can emptyadditional buffers in future operations.

FIG. 7 depicts an example showing part of a tree that comprises twolevels of nodes containing messages in buffers. Call the node highest inthe tree u (701). The pivots in the node are drawn as circles (706). Thebuffers in the node are drawn as rectangles (709). Some of the buffersare indicated by an ellipsis (710). Denote the first and second of u'schildren as v (702) and w (703), respectively. Denote the penultimateand ultimate of u's children as x (704) and y (705). The child pointerfrom u's parent to u (707) and the child pointers from u to v, w, x, andy (708) are indicated by downward arrows. The messages in the buffersare drawn as diamonds (712).

FIG. 8 depicts the same part of the tree as shown in FIG. 7, aftermessages have been moved down the tree. The penultimate buffer (806) innode u (801) has some of its messages removed and placed in thechildren. The buffers associated with the other children v (802), w(803), y (805) are not modified in this example. One element (811)remains in the buffer after the flush. The rest of the messagespreviously in that buffer are flushed to buffers in the associated childx (804). Two messages (812) are inserted into the first buffer (807),one message (813) is inserted into the second buffer (808), one message(814) is inserted into the penultimate buffer (809) and one message(815) is inserted into the last buffer (810). Alternatively, it is alsopossible to implement buffer flushes that empty out all bufferscompletely or so that some messages may remain.

Alternatively, there are other ways to accomplish the movement ofmessages to the child of a node nodes. For example, it is not necessaryto actually construct the sequence of messages. Instead one coulddequeue one message at a time and insert it into the child node.

Alternatively, there are many ways to implement the movement of messagesin a data structure in which messages move opportunistically into nodesthat are in RAM, but are sometimes delayed if the destination node isnot in RAM. For example, the system could use part of main memory tostore a balanced search tree or other dictionary. Most of the time, thebalanced search tree remains in RAM. At each of the leaves of thedictionary is a reference to another dictionary. When a message isinserted, the balanced search tree sends the message to the appropriatedictionary. That is, when a message is inserted, the balanced searchtree in RAM is used to partition the search space. Then, the message isinserted directly into a dictionary. In one mode the system does not usea tree-based structure in the leaves but instead uses a cache-obliviouslookahead array (COLA).

FIG. 12 illustrates an example of a two-level system. The system employsa dictionary near the root (1201), including but not limited to abalanced search tree, which is stored in RAM. All messages that areinserted travel through the dictionary to the leaves. At each leaf,there is a dictionary (1202), and the message is inserted into one ofthese dictionaries.

Alternatively, a system could move only some of the messages to thedestination. For example if the destination fills up, the system coulddelay sending additional messages to the destination until some futuretime when the destination has forwarded its messages onward.

The process by which messages move directly down the tree without beingstored in intermediate buffers is referred to as aggressive promotion.

Alternatively, a system can implement aggressive promotion that isadaptive, even when the particular data structure is not tree-based. Forexample, a COLA can implement aggressive promotion, as follows: Ratherthan putting the message directly in the lowest-capacity level of theCOLA, put the message (in the appropriate rank location) in the deepestlevel of the COLA that is still in RAM and where space can be made.Thus, the system could use a packed-memory array to make space in thelevels. The system could also use a modified packed-memory array whererebalance intervals are chosen adaptively to avoid additional memorytransfers.

FIG. 13 illustrates such a scheme in a non-tree-based streamingdictionary. The dictionary contains levels of geometrically increasingsize. There is a first level (1301) that contains one array 80 position,a second level (1302) that contains two array positions, a third level(1303) that contains four array positions, a fourth level (1304) thatcontains eight array positions, a fifth level (1305) that contains 16array positions, and a sixth level (1306) that contains 32 arraypositions. The hatched array cells are paged out to disk, and thenonhatched array cells are paged into memory. The sinuous arrow (1307)from top to bottom identifies the search path for a message. Inaggressive promotion, the message is inserted into a deep level that ispaged into memory, one or more steps down the search path. In order tomake room for new messages inserted into the arrays, packed-memoryarrays or adaptive packed-memory arrays are used. Thus, each arraystores messages in sorted order with a linear number of gaps between thearrays. There are lookahead pointers which can help to reduce the searchcost per level to O(1).

In this picture, the new message is inserted directly into the fiftharray with 16 array positions. In order to make room for the message,there is a rebalance, as indicated by a rebalance interval (1308). Therebalance interval is chosen so that it only involves array cells thatare paged into memory. If such a rebalance interval had not been foundon one level, then the element would be inserted into a higher level.

Alternatively, this structure can be modified to support messages withdifferent lengths. For example, one could use a PMAVSE (which isdescribed below). The structure can be modified so that the ratiobetween different levels is different from 2. Moreover, one could use adifferent structure from a PMA at each of the levels.

Alternatively, the paging scheme might depend on how messages movethrough the data structure. For example, the system may choose topreemptively bring into RAM a part of the data structure that is thedestination of messages.

When a key-value pair is inserted into a dictionary, the systemconstructs an insert message (2801) containing the XID of thetransaction inserting the message, and the key and the value. Then asequence of length one is created containing that message.

1. If the root of the tree node is not in RAM, then the system brings itinto RAM from disk.

2. The sequence is then executed on the node.

Alternatively, one can process the messages differently. For example,for each leaf, the system could maintain a hash table of alltransactions which are provisional, indexed by XID. If, when anabort_any_any (2802) arrives at a leaf, the system could operate only onthose leaf entries that mention the XID. Similarly, the system couldmaintain, for each nonleaf node, a hash table of all the uncompletedtransactions in the subtree, so that an abort_any_any message would onlyneed to be sent to certain subtrees. Alternatively, instead of using ahash table, the system could use another data structure, such as a Bloomfilter, which would indicate definitively that a particular subtree doesnot contain messages or leaf entries for a particular transaction.

Messages on Leaves

To execute an insert message with XID x, key k, and value v on a leaf,the system looks up, in the OMT (1101) of the leaf, the leaf entry thekey of which equals the key of the message (that is, the message keymatches the leaf entry key) (for NODUP dictionaries) or matches both thekey and the value (for DUP dictionaries).

-   -   1. If there is no matching leaf entry, then a LE_PROVVAL leaf        entry is inserted into the OMT (1101) with key k, XID x, and        value v.    -   2. If there is a LE_COMMITTED leaf entry with key k′ and        committed value c, then that leaf entry is replaced by a new        LE_BOTH leaf entry with key k, XID x, committed value c,        provisional value v.    -   3. If there is a LE_BOTH leaf entry with key k′, XID x′,        committed value c, and provisional value p, then the system does        the following:        -   (a) If x=x′ then replace the leaf entry with a new LE_BOTH            leaf entry with key k, XID x, committed value c, and            provisional value p.        -   (b) Otherwise replace the leaf entry with a new LE_BOTH leaf            entry with key k, XID x, committed value p, and provisional            value v.    -   4. If there is a LE_PROVDEL leaf entry with key k′, XID x′ and        committed value c, then the system does the following:        -   (a) If x=x′ then replace the leaf entry with a new LE_BOTH            leaf entry with key k, XID x, committed value c, and            provisional value v.        -   (b) Otherwise replace the leaf entry with a new LE_PROVVAL            leaf entry with key k, XID x, and provisional value v.    -   5. If there is a LE_PROVVAL leaf entry with key k′, XID x′, and        provisional value p, then the system does the following:        -   (a) If x=x′ then replace the leaf entry with a new            LE_PROVVAL leaf entry with key k, XID x, and provisional            value v.        -   (b) Otherwise replace the leaf entry with a new LE_BOTH leaf            entry with key k, XID x, committed value p, and provisional            value v.

To execute on an OMT a delete_any (2803) with XID x and key k, for eachleaf entry in the OMT that has a key matching k the system does thefollowing:

-   -   1. If the leaf entry is a LE_COMMITTED leaf entry with key k and        committed value c, then replace the leaf entry with a new        LE_PROVVAL leaf entry with key k, XID x, and committed value c.    -   2. If the leaf entry is a LE_BOTH leaf entry with key k, XID x′,        committed value c, and provisional value p, then the system does        the following:        -   (a) If x=x′ then replace the leaf entry with a LE_PROVVAL            leaf entry with key k, XID x, and committed value c.        -   (b) Otherwise replace the leaf entry with a LE_PROVVAL leaf            entry with key k, XID x, and committed value p.    -   3. If the leaf entry is a LE_PROVVAL leaf entry with key k, XID        x′ and committed value c, then the system does the following:        -   (a) If x=x′ then replace the leaf entry with a LE_PROVDEL            leaf entry with key k, XID x, and committed value c.        -   (b) Otherwise remove the leaf entry without replacing it.    -   4. If the leaf entry is a LE_PROVVAL leaf entry with key k, XID        x′, and provisional value p, then the system does the following:        -   (a) If x=x′ then remove the leaf entry without replacing it.        -   (b) Otherwise replace the leaf entry with a LE_PROVDEL leaf            entry with key k, XID x, and committed value p.

To execute on an OMT a delete_both (2804) the system finds all leafentries that match both the key and the value of the message, and foreach such leaf entry performs the steps specified in the previousparagraph, just as if the message were a delete_any (2803).

To execute on an OMT a commit_any (2805) with XID x and key k, for eachleaf entry in the OMT that has a key matching k the system does thefollowing:

-   -   1. If the leaf entry is a LE_COMMITTED leaf entry with key k and        committed value c, then replace the leaf entry with a new        LE_COMMITTED leaf entry with key k, XID x, and committed value        c.    -   2. If the leaf entry is a LE_BOTH leaf entry with key k, XID x′,        committed value c, and provisional value p, then replace the        leaf entry with a LE_COMMITTED leaf entry with key k, XID x, and        committed value p.    -   3. If the leaf entry is a LE_PROVDEL leaf entry with key k, XID        x′ and committed value c, then remove the leaf entry without        replacing it.    -   4. If the leaf entry is a LE_PROVVAL leaf entry with key k, XID        x′, and provisional value p, then replace the leaf entry with a        LE_PROVDEL leaf entry with key k, XID x, and committed value p.

To execute on an OMT a commit_both (2806), the system finds all leafentries that match both the key and the value of the message, and foreach such leaf entry performs the steps specified in the previousparagraph, just as if the message were a commit_any (2805).

To execute on an OMT an abort_any (2807) message with XID x and key k,for each leaf entry in the OMT that has a key matching k, the systemdoes the following:

-   -   1. If the leaf entry is a LE_COMMITTED leaf entry with key k and        committed value c, then replace the leaf entry with a new        LE_COMMITTED leaf entry with key k, XID x, and committed value        c.    -   2. If the leaf entry is a LE_BOTH leaf entry with key k, XID x′,        committed value c, and provisional value p, then        -   (a) if x=x′ then replace the leaf entry with a LE_COMMITTED            leaf entry with key k, and committed value c.        -   (b) otherwise replace the leaf entry with a LE_COMMITTED            leaf entry with key k, and committed value p.    -   3. If the leaf entry is a LE_PROVDEL leaf entry with key k, XID        x′ and committed value c, then replace the leaf entry with a        LE_COMMITTED leaf entry with key k and committed value c.    -   4. If the leaf entry is a LE_PROVVAL leaf entry with key k, XID        x′, and provisional value p, then remove the leaf entry without        replacing it.

To execute on an OMT an abort_both (2808) message, the system finds allleaf entries that match both the key and the value of the message, andfor each such leaf entry performs the steps specified in the previousparagraph, just as if the message were an abort_any (2807).

To execute on an OMT an abort_any_any (2802) message, the system findsall the leaf entries that have provisional states that match the XID ofthe message, and transforms those as if an abort_any (2807) wereexecuted. For example

-   -   1. For LE_COMMITTED leaf entries, no change is made.    -   2. For LE_BOTH leaf entries, if the XID of the leaf entry        matches the message, then replace the leaf entry with a        LE_COMMITTED leaf entry using the previously committed value. If        the XIDs do not match, then no change is made.    -   3. For LE_PROVDEL leaf entries, if the XID of the leaf entry        matches, then replace the leaf entry with a LE_COMMITTED message        using the previous committed value from the leaf entry. If the        XIDs do not match, then no change is made.    -   4. For LE_PROVVAL leaf entries, if the XID of the leaf entry        matches, then delete the leaf entry, otherwise no change is        made.

In all the cases above, when a leaf entry is created its checksum isalso computed.

In some conditions when a leaf entry is queried, the system can changethe state. For example, the system maintains a list of all pendingtransactions. If a leaf entry is being queried, then all of the messagesdestined for that leaf entry have been executed. If the leaf entryreflects a provisional state for a transaction that is no longerpending, then the system can infer that the transaction committed(because otherwise an abort message would have arrived), and so thesystem can execute an implicit commit message.

The system maintains in each node statistical or summary information forthe subtree rooted at the node. FIG. 33 shows a data structure that canmaintain such summary statistics. The statistics (414) structurecomprises the following elements:

-   -   1. a number ndata (3301) representing an estimate of the number        of key-value pairs in the subtree rooted at the node,    -   2. a number ndata_error_bound (3302) bounding the estimate error        for ndata (3301),    -   3. a number nkeys (3303) representing an estimate of the number        of distinct keys in the subtree,    -   4. a number nkeys_error_bound (3304) representing the estimate        error for nkeys (3303),    -   5. a key or key-value pair (depending on whether the tree is a        NODUP or DUP key respectively), minkey (3305), representing an        estimate of the least pair in the subtree,    -   6. a key or key-value pair (depending on whether the tree is a        NODUP or DUP key respectively), maxkey (3306), representing an        estimate of the greatest pair in the subtree, and    -   7. a number ds i z e (3307) representing an estimate of the sum        of the lengths of the leaf entries.

In a leaf node, the system can maintain a count of the number of leafentries in the ndata (3301) field. If the system quiesces, and alltransactions are committed or aborted, then this count is the number ofrows in the node. If the system is not quiescent or some transactionsare pending, then the count can be viewed as an estimate of the numberof entries in the dictionary. The difference between the estimate andthe quiescent value is called the estimate error, and the estimate errorcannot be determined until the system quiesces and the relevanttransactions are completed. Every time a leaf entry is added, the countis incremented, and every time a leaf entry is removed, the count isdecremented.

The system maintains in each leaf node a count ndata_error_bound (3302)bounding the estimate error for ndata (3301)

-   -   1. Each LE_COMMITTED leaf entry contributes zero to the bound,    -   2. each LE_BOTH leaf entry contributes zero to the bound        (because whether the transaction aborts or commits, the count        will not change),    -   3. each LE_PROVDEL leaf entry contributes one to bound (because        ndata (3301) is counting the leaf entry, but if the appropriate        transaction commits, the leaf entry will be removed), and    -   4. each LE_PROVVAL leaf entry contributes one to the bound        (because if the appropriate transaction aborts the leaf entry        will be removed).

For nonleaf nodes, the system maintains the ndata (3301) field as thesum of the ndata (3301) fields of its children. The system maintains thendata_error_bound (3302) as the sum of the ndata_error_bound (3302)fields of its children, plus the number of messages in the buffers ofthe node. If any of the entries are delete_any messages, then thendata_error_bound (3302) is set to ndata (3301) of the node, since insome cases all the leaf entries could be deleted by those messages.Alternatively, tighter bounds for ndata_error_bound (3302) can be used.For example, a delete_any message can only delete one key, so if thereare many unique keys, then the ndata_error_bound (3302) can sometimes bereduced.

Similarly, the system can maintain a count of the number of unique keysnkeys (3303) in a leaf node, along with correct values for minkey (3305)and maxkey (3306).

For nonleaf nodes, the system can combine results from subtrees tocompute nkeys (3303). Given two adjacent subtrees, A and B, (A to theleft of B), then if the maxkey (3306) of A equals the minkey (3305) ofB, then the number of distinct keys in A and B together is the number ofunique keys in A plus the number of unique keys in B minus one. If themaxkey (3306) of A is not equal to the minkey (3305) of B, then thenumber of unique keys in A and B together is the sum of the number ofunique keys in A and B. Thus, by combining all the results from thechildren, the proper value of nkeys (3303) can be computed. Thenkeys_error_bound (3304) can be computed in a way similar to thendata_error_bound (3302).

For the data size estimate dsize (3307), each leaf can keep track of thesum of the sizes of its leaf entries, and a subtree can simply use thesum of its children.

In many cases an estimate of the number of rows or distinct keys or datasize in a tree is useful even if the estimate has an error. For example,a query optimizer may not need precise row counts to choose a good queryplan. In such a case, the summary statistics at the root of the treesuffice.

In the case where an exact summary statistic is needed, the system cancompute the count exactly. To compute exact statistics, or to computethe statistics to within certain error tolerances, as viewed by aparticular transaction, the system can perform the following actions:

-   -   1. Check ndata_error_bound (3302) for that subtree. If the error        bound is tight enough (for example if it is zero), then ndata        (3301) is the correct value and can be used. Also the if the        ndata_error_bound (3302) is zero, then the dsize (3307) is        exactly right. Similarly, the nkeys_error_bound (3304) can        determine if the nkeys (3303) is accurate enough.    -   2. For any value that has too-loose error bounds, in a leaf        node, the system iterates through the leaf entries, performing a        query on each the key in each leaf entry. Assuming that the lock        tree permits the queries to run without detecting a conflict,        then after all the implicit commits operate, the number of leaf        entries remaining in the node is the correct number, and the        estimate error will be zero. (If the lock tree does not permit        the queries, then there is a conflict, and the translation must        wait or abort or try again.)    -   3. For nonleaf nodes, the system can iterate through the        children performing this computation recursively. Any subtree        with an accurate enough estimate can calculate it quickly, and        otherwise the computation descends the tree summing up the        statistics appropriately.

Alternatively, the statistics (414) field is a structure that can beincorporated directly into some other structure, as shown in FIG. 4 orFIG. 26, or it could be incorporated with a pointer, similarly to thememory pool (1001) in FIG. 4.

For each child of a nonleaf nodes, the system stores a copy of thechild's statistics in the subtreestatistics (2612) field of theappropriate child information structure (2605). The system can use thosecached values to incrementally recompute the statistics of a node when achild's statistics change.

Alternatively, additional statistical summary information could be addedto the statistics (414). For example, if a dictionary comprises rowscomprising fields, then the statistics could keep a summary value forsome or all of the fields. Examples of such summary values are theminimum value of a field, the maximum value of a field, the sum of thefield values, the sum of the squares of the field values (which couldfor example be used to compute the variance and standard deviation ofthe field values), the logical “and”, logical “or”, or logical“exclusive or” of fields treated as Booleans or as integers (where thelogical operations operate bitwise on the values). The system could alsobe modified to maintain an estimate of the median value, or percentilevalues for particular percentile ranks (such as quartiles). A subtreefingerprint calculation can also be viewed as a kind of summary.

Alternatively, the summary information can be maintained incrementallyas the tree is updated. For example, each parent's summary can beupdated as soon as its child is updated. Alternatively, a parent'ssummary can be updated in a “lazy” fashion, waiting until the child iswritten to disk to update the parent. In this alternative case, whenperforming a query on the statistical summary, the system can walk thein-RAM part of the tree to calculate summary information, optionallyupdating the summary for the various nodes, and setting a Boolean toremember that the subtree has not been changed since the summaryinformation was calculated.

To implement nested transactions, the system uses a different kind ofleaf entry that comprises a stack of XIDs (described in more detailbelow). In this mode, transactions can be created, committed, andaborted. Given a transaction, operations can be performed within thattransaction, including looking up values in the tree and inserting newvalues into the tree, and creating a child transaction. The childtransaction is said to be inside the parent transaction. The systemmaintains a set of all the open transactions using an instance of anOMT. The set of open transactions can be held in another data structure,including but not limited to a hash table, using the least significantbits of the XID to select the hash bucket. Alternatively, one canimplement implicit commits, and maintain the counters such asndata_error_bound (3302) and ndata (3301) in a system with nestedtransactions.

Alternatively, one can reduce the number of accesses into theopen-transaction set, for example, by employing an optimistic lockingscheme. One implementation of such a scheme would be to maintain aglobal counter that is incremented every time a transaction begins,aborts, or commits. If the counter does not change between two XID lookups then the result can be assumed to be the same. If the counter doeschange, then another look up would be required. Another alternative isto record a pointer directly to the transaction record along with everyXID, thus entirely avoiding the look up. Yet another alternative is tomaintain a per-thread cache of recently tested XIDs that are known to beclosed.

Nested Transactions

In a mode that implements nested transactions, the system operates asfollows. A leaf entry comprises a key and a stack of transactionrecords. The bottom of the stack represents the outermost transaction tomodify this leaf entry, the top of the stack represents the innermosttransaction. Each transaction record comprises an XID and a value. Thevalue in each transaction record is the value for the key if thetransaction successfully commits. Each transaction also comprises someBoolean flags. When a transaction performs an insert, the new value isstored in the transaction record. When a transaction performs a delete,the value is replaced by a delete flag.

In this scheme each message (including but not limited to insert,delete, abort) contains the XID of the current transaction and also theXIDs of all enclosing transactions.

When a transaction aborts, an abort message is sent to every leaf entrymodified by that transaction. When a transaction is committed, nomessages are sent.

When a message arrives at a leaf entry, the list of transaction ids inthe message is compared with the transaction records in the leaf entryto find the Least Common Ancestor (LCA). Any transactions in the leafentry newer than the LCA could only be missing from the message if theyhad committed, so the system can promote the values in those transactionrecords to a committed state.

FIG. 34 shows an example of insertion with nested transactions (3401).In FIG. 34, X_(n) is an XID, and k and j are keys, and v_(n) is a value.

FIG. 35 shows the insert messages that are sent into the tree to theleaf entry with key k as a result of these transactions.

Each message contains the XIDs of the current transaction and of allenclosing transactions. For example, Transaction X₃ did not directlymodify the entry at key k, so there is no message addressed to k with X₃as its first XID. But the XID for X₃ is included in message (3504)because transaction X₄ is enclosed within X₃.

When these messages arrive at the leaf entry for key k, they areprocessed as shown in FIG. 36.

With the arrival of message (3501), the message contents are insertedinto a new stack. The leaf contents (3601) mean that key k is nowassociated with the value v₀ and that if transaction X₀ successfullycommits, then key k will have value v₀. Because there is no entry beforethe entry for transaction X₀, if transaction X₀ does not successfullycommit then the leaf entry for key k will be destroyed.

After processing message (3502), the leaf entry stack reflects not onlythe value k it will have if X₁ commits successfully (v₁), but also thevalue it would have if transaction X₁ aborts but X₀ commits (v₀), aswell as the value (none) if both X₁ and X₀ abort.

Upon processing message (3503), the system infers that transaction X₁committed successfully by going up the list of enclosing transactions inmessage (3503) and comparing it with the list of enclosing transactionsin leaf entry (3602). The system calculates that the LCA is transactionX₀. In the absence of an abort message, this implies that transaction X₁committed successfully. Since transaction X₁ is now complete with v₁,the value that would be saved if X₀ were to commit successfully is nowv₁. So the v₁ is copied from the stack entry for X₁ overwriting thevalue previously stored in the stack entry for X₀. This process ofmoving a value higher in the stack is called promotion.

Upon processing message (3504), two changes are made to the leaf. A newstack entry is created for transaction X₄ with a value of v₄, and a newstack entry is created for transaction X₃. Even though X₃ did notdirectly modify the value associated with this key, the enclosedtransaction X₄ enclosed inside X₃ did. This is reflected in the stack ofleaf entry (3604).

In this example, after processing message (3504), the stack oftransaction records contains the value v₂ twice, once each for X₂ andX₃. The system employs a memory optimization to replace v₂ in thetransaction record for X₃ with a placeholder flag, indicating that thevalue for transaction X₃ is the same as the value in the transactionrecord below it, in this case X₂. FIG. 37 shows a leaf entry (3701)representing the same information as leaf entry (3604), except that leafentry (3701) contains a placeholder. Alternatively, the system couldemploy other representations the same information, including but notlimited to creating a pointer to the same value instead of using aplaceholder.

During a query, or look up of a key, the read lock for the leaf entry isnot necessarily taken, since the system tests the lock after the read.If a transaction unrelated to the transaction issuing the query iswriting to this leaf entry, then that unrelated transaction is open andthe system does not implicitly promote the value. So any implicitpromotions done during the query can be based solely on whether thetransactions with XIDs that are recorded in the leaf entry are stillopen.

The system operates as follows when performing a look up. For everytransaction in the leaf entry (starting with the innermost and goingout), if the transaction is no longer open then implicitly promote it.

Each query is accompanied by a list of XIDs of all the enclosingtransactions, similar to the stack of transaction ids that accompanyeach insert. The set of transaction ids is passed on the call stack asan argument to the query function, but it could be passed in other ways,for example as a message. While this list may not be sufficient todetermine that a given transaction is definitely closed, it can provethat a transaction is still open. This information can be used as a fasttest to determine whether a dispositive test is required. If atransaction is definitely open, then it is not necessary to look up itsXID in the global set of open transactions.

FIG. 38 shows the same set of transactions as FIG. 34, except with a fewqueries added (3801)a.

FIG. 39 shows the state of the leaf after processing each message andquery.

The query (3910) inside transaction X₃ is accompanied by the sequence ofXIDs

X₃, X₂, X₀

. When the query is processed, the XIDs in leaf entry (3903) arecompared with the set of open transactions. Transaction X₂ is theinnermost transaction in the leaf entry, so the system compares it withthe list of XIDs accompanying the query message, and sees that X₂ isstill open and no further action needs to be taken.

The query (3911) after the close of transaction X₃ is accompanied by thesequence of XIDs

X₂, X₀

. When the query is processed, the innermost XID X₄ of the leaf entry iscompared with the sequence of transactions in the query message. BecauseX₄ is not in the sequence, it is possible that X₄ has committed, so thesystem examines the global list of open transactions. Since X₄ wascommitted, the system promotes the value for X₄ by copying it to thenext inner transaction record (for X₃) and removing X₃. Then the systemsees that X₃ was also committed because it is not in the global list ofopen transactions, so it promotes the value to the transaction recordfor X₂ (removing X₃). The system then sees that X₂ is still an opentransaction and stops. At this point the value v₄ can now be found inthe transaction record for X₂.

The query (3912) is performed after X₀ is committed, so when it isprocessed the set of open transactions is empty. The implicit promotionlogic recognizes that transactions X₂ and X₀ have been committed andmodifies the leaf entry to have only one transaction record marked asthe committed value with a XID of zero. An XID of zero is the roottransaction, and is shown as a “root” XID in both the query (3912) andthe leaf entry (3907) of FIG. 39.

Deletes are handled in a manner that is similar to inserts. When adelete message arrives at a leaf entry, the same implicit promotionlogic is applied as when an insert arrives. But instead of copying avalue into the innermost transaction record, the system sets a “delete”flag.

Furthermore, if the next outer transaction record in the leaf entry is adelete, then the newly arrived delete is not recorded because no matterwhether the transaction for the new delete is committed or aborted therewill be no change to the leaf entry. The leaf entry will still besubject to to the delete issued by the enclosing transaction, and anyquery in this transaction (after the delete and before an insert)discovers no value. Alternatively, other approaches could be taken. Forexample the system could store transaction records for nested deletesand then remove up those records at a later time to facilitate thedestruction of the leaf entry.

Also, if after the delete message is applied to the leaf entry, the onlytransaction record is a delete then the leaf entry is removed from theOMT. If the transaction is committed or if it aborts, the leaf entrywill not exist, which can be represented by the absence of a leaf entry.

FIG. 40 shows an example of a set of transactions with insertions anddeletions (4001). FIG. 41 shows how the operations of FIG. 40 areprocessed. The delete message (4110) modifies the leaf entry just as ifit was an insert message, except that instead of a value in thetransaction record there is a delete flag.

The arrival of message (4111) has no effect. It would be logicallycorrect to add a transaction record

X₄, delete

on the top of the stack, but it is not necessary. Instead, the systemleaves the leaf entry unchanged because if X₄ were to commit immediatelyafter the arrival of message (4111), then it would look the same as ifX₄ were to abort immediately after the arrival of the message. Also, anyquery issued within X₄ before the insert at Line 13 in FIG. 40 wouldfind no value, with or without a transaction record for X₄.

When message (4112) arrives at the leaf entry, the leaf entry isdestroyed. The implicit promotion logic causes the leaf entry totemporarily take on the value

X₀, v₄

, but then the transaction record for X₀ is modified to have a value of

X₀, delete

. Since at that point the only transaction record is for a delete, theleaf entry can be destroyed.

The arrival of an abort message at a leaf entry, is similar to thearrival of an insert or delete, causing the implicit promotion of valuesset by transactions that have been committed. After performing thatimplicit promotion the system removes the transaction record for theaborted transaction in the leaf entry, and then removes any placeholdersthat are on the top of the transaction record stack.

FIG. 42 shows an example of a set of nested transactions where theinnermost transaction aborts (4201). FIG. 43 shows how the operations ofFIG. 42 are processed. With the arrival of message (4310), thetransaction record for X₄ has been removed, leaving the placeholder forX₃ at the top of the stack. Then the placeholder for X₃ is removed,leaving the transaction record for X₂ at the top of the stack. Thus onceX₄ is aborted, X₂ is the innermost transaction to have modified the leafentry.

Alternatively, other variants of this strategy can be implemented. Forexample, when a transaction is committed, a message could be sent. Asanother example, if a message is sent whenever a transaction iscommitted, then the system can query the data without implicitlypromoting leaf entries. As another example, the system could send commitmessages when it is otherwise idle, in which case the system whenquerying would perform implicit promotion if needed.

Alternatively, the scheme can be adapted to support different isolationlevels. For example, to support a read-uncommitted isolation levelduring a query the system can return the value at the top of the valuestack if the top of the stack identifies a pending transaction.

Balancing

The system employs a parameter called the maximum degree bound, which isset to 32. If the number of children of n ever exceeds the maximumdegree bound then the system splits the node in two. If two nodes areadjacent siblings (that is one is child i of a node and the other ischild i+1 of the node) and the total number of children of two siblingnodes is less than half the maximum degree bound, then the system mergesthe two nodes.

Alternatively, the maximum degree bound could be set to some othernumber, which could be a constant or could be a function of some systemor problem-specific parameters, or a function of the history ofoperations on the data structure, or a function of the sizes of thepivot keys, or according to other reasons. It may also vary within thetree.

When a nonleaf node has c children, numbered from 0 inclusive to cexclusive, then it can split a node in two as follows. The systemallocates a new block number using the block translation table (3001).It moves the children numbered from c/2 inclusive to c exclusive to thenew node, numbering them from 0 inclusive to c−c/2 exclusive in the newnode. When moving a child, the buffer is moved too. The pivot keys,which are numbered from 0 inclusive to c−1 exclusive, are alsoreorganized. The pivot keys from c/2 inclusive to c−1 exclusive aremoved, renumbering them from 0. Pivot key number c/2−1, called the newpivot, is removed from the old node, and is inserted into the pivot keysof the node's parent. If the node child number i of its parent, then themoved pivot key becomes numbered i in the parent, and the highernumbered pivot keys are shifted upward by one. If the node has noparent, then a new parent is created with a single pivot key. In theparent, the block number of the new child is inserted so that the newchild is child number i+1 in the parent, and any higher numberedchildren are shifted up by one.

The buffer in the parent is also split. That is, if the parent existed,then the messages in buffer number i of the parent are removed from thatbuffer, and are copied into buffers i and i−1 as they would be duringmessage execution in a nonleaf node. That is, each message is examinedand if its key is less than or equal to the new pivot then it is copiedinto buffer i, and if its key is greater than or equal to the new pivotthen it is copied into buffer i+1.

After splitting a node, the node may end up being larger than its targetsize. In that case, the system flushes some buffers. Alternatively, thesystem may wait until some future operation to flush some buffers.

After splitting a node, the parent node may have more children than themaximum degree bound. In that case, the system splits the parent.Alternatively, the system may wait until some future operation to splitthe parent.

When a leaf-node exceeds its target size, the system splits the leafnode creating a new node, and moving the greater half of the key valuepairs to the new node. An appropriate pivot key is constructed whichdistinguishes between the lesser half and the greater half of the keyvalues, and the pivot key and new node are inserted into the parent, andthe corresponding buffer in the parent is split, just as for the case ofsplitting a nonleaf node. Similarly, if there is no such parent, then anew parent is created just as when splitting a nonleaf node.

To merge two nonleaf nodes that are adjacent siblings is essentially theopposite of splitting a node. If one node has c₀ children and is child iof its parent and the other node has c₁ children and is child i+1 thenpivot key i in the parent is moved to be pivot key number c₀ in thefirst node, and the parent's higher numbered pivot keys shift down, andthe pivot keys of the second node are set to be the pivot keys numberedfrom c₀+1 inclusive to c₀+1+c₁ exclusive. The child pointers and buffersfrom the second node are moved to the first node. And in the parentbuffer i and buffer i+1 are merged together by dequeueing each item frombuffer i+1 and enqueing it into buffer 1. Buffer i+1 is freed, and thebuffer and child pointers are shifted downward.

To merge two leaf nodes that are adjacent siblings is similar. Theparent node is changed in the same ways (merging buffers and shiftingpivot keys, buffers, and child pointers down). The two children aremerged by moving all the leaf entries from child i+1 to child i.

The now-unused child's block number is returned to the free list in theblock translation table (3001).

After merging two nodes, the resulting node may be larger than thetarget size for that node. In that case system flushes buffers.Alternatively, the system may flush the buffers at a future time.

Alternatively, there are other ways of splitting and merging nodes. Forexample, the buffers that are to be split or merged may be flushedbefore the split or merge actually takes place.

Alternatively, there are other ways of implementing the tree. Forexample, the fanout and number of pivot keys in each node can bevariable, and could depend on the size of the pivot keys. Some fixedamount of space could be dedicated to the pivot keys. For 1 MB blocks,this space could be between 1 KB and 4 KB, unless the pivot keys arelarger than 4 KB. In this case, there might be only a single pivot key.

Alternatively, it is possible to place a maximum limit on the number ofpivot keys, regardless of how small the keys are.

In each node the system keeps a counter of how many successiveinsertions have inserted at the rightmost edge of a node. When splittinga node, if that counter is more than half the number of leaf entries inthe node, then instead of splitting the node in half, the system splitsthe node unevenly so that few or no leaf entries are moved into the newnode. This has the effect of packing the tree more densely whenperforming sequential insertions.

Alternatively, the system can employ other ways of optimizing sequentialinsertions or other insertion patterns. For example, another way todetect sequential insertions is for the system to keep track of the lastkey inserted, and whenever an insertion is to the immediate left of thelast insertion, and a node splits, the system splits the node just tothe right of the last insertion. Furthermore, the system could keep acounter for each node, or for the whole tree, of how many successiveinsertions inserted just to the left of the previous insertion, and usethat information to decide how to split a node. Similarly the systemcould detect and optimize for sequential insertions of successivelysmaller keys.

Alternatively, when merging nodes, the system could consider the node tothe left or to the right of a node, and merge more than two nodes. Thedecision to merge could be based on a different threshold, including butnot limited to that the combined size is less than 10% of a node'starget size.

Alternatively, the system could adjust the target size of a node basedon many factors. For example, the system could keep a time stamp on eachnode recording the last time that the node was modified. The systemcould then adjust the target size depending on the time since the nodewas modified.

Cursors

A cursor identifies a location in a dictionary. One way to identify alocation is using the key-value pair stored in that location. A cursorcan be set to the position of a given key-value pair, and can beincremented (moved to the next larger key-value pair) or decremented(moved to the next-smaller key-value pair) in the tree. The systememploys Cursors to implement other query and update operations in thetree, as well as other functions, such as copying a tree for a backup.

The system implements cursors comprising:

-   -   1. A root-to-leaf path in the tree, where the cursor indicates        one of the key-value pairs in that leaf. Multiple cursors are        allowed to point to a single key-value pair in a given leaf        node.    -   2. Each leaf node stores a set of all cursors pointing to        key-value pairs in that leaf.

The root-to-leaf path for a cursor is stored as follows:

⟨⟨root  blocknum, child  number⟩,   ⟨nonleaf-node  blocknum, child  number⟩,   …  ⟨nonleaf-node  blocknum, child  number⟩,   leaf  number⟩

When the tree changes shape (e.g., because of tree-balancing operations)the system updates any affected paths.

A cursor points to leaf nodes that are in RAM. That is, a nodecontaining a key-value pointed to by a cursor is pinned in RAM and isnot ejected until the cursor is deleted or moves to another node.

The cursor implementation maintains the property that every buffer onthe path of a cursor is empty. This means that setting a cursor to pointto a given node triggers emptying of the buffers on the cursor'sroot-to-leaf path.

Each buffer maintains a reference count of the number of cursors thatpass through that buffer. When the reference count of a buffer isnonzero, a message is sent directly to the child node of the buffer.When the reference count is nonzero, a message is stored in the bufferor passed down according to the buffer management rules outlined above.

FIG. 9 depicts a path-based implementation of cursors. The tree (901)storing all the data has depth four; there are 26 nodes, labeled withall the letters of the alphabet. There is a cursor dictionary (903) withfive different cursor paths. The first cursor path (902, 904) is abem.There are four other cursor paths (905) stored in the cursor dictionary(903). Inserting a cursor causes a root-to-leaf path to be flushed,causing all of the elements pointed to by the cursors to be in theleaves.

Alternatively, there are other ways of implementing cursors. Forexample, rather than storing root-to-leaf paths, one could store thekey-value pairs in an in-RAM dictionary. The cursor root-to-leaf pathsare implicit, rather than explicit. The solution is efficient forenabling a node of the streaming dictionary to query whether any cursorstravel through it by performing a query on the in-RAM dictionary. Allcursor updates involve predecessor and successor searches in thedictionary. This solution also further decouples the paging from thecursors. The solution can be useful for cursors that operate on non-treedictionaries.

FIG. 14 depicts an implementation of cursors in another mode ofoperation of the system. This cursor mode is implemented as key-valuepairs in a smaller dictionary. The top part of the tree storing all ofthe elements is depicted above (1401). There are four cursors indicatingkey-value pairs k₁ (1402), k₂ (1403), k₃ (1404), and k₄ (1405). Thecursors may indicate key-value pairs that are still in message form atnonleaf nodes in the tree. Moreover, the cursors may indicate key-valuepairs in nodes that are paged out to disk. In particular, of the nodesindicated, those labeled (1406) holding k₁ and (1408) holding k₃ arepaged out. In contrast, those labeled (1407) holding k₂ and (1409)holding k₄ are paged into RAM. Below is a small in-RAM dictionary (1410)storing key-value pairs k₁ (1411), k₂ (1412), k₃ (1413), and k₄ (1414).Only the keys need to be stored in this small dictionary. There are nopointers between the k_(i)s in the big streaming dictionary and thesmall in-RAM dictionary holding the cursors.

In another mode of operation, the a cursor is represented using apointer at an OMT along with an index. The cursor also includes apointer that points into the memory pool (1001) of the OMT, pointing atthe key-value pair that the cursor is currently referencing. All ofbuffers are flushed on the root-to-leaf path from the root of thedictionary to the leaf node containing the OMT. The cursor provides acallback function to disassociate the cursor from the OMT. The callbackfunction copies the key-value pair into a newly allocated region of RAM,and causes the cursor to stop referring to the OMT and the memory pool(1001). When the cursor is disassociated it contains a pointer to anallocated region of RAM containing the key-value pair. If any operationresults in a message entering one of the buffers along the path, or ifthe OMT reorganizes itself in RAM, or if the pointer into the memorypool becomes invalid, or the pointer to the OMT or the index in the OMTbecome invalid, then the system invokes the dissociation callbackfunction.

To advance the cursor, if the cursor is associated with an OMT, then theindex is incremented, and the OMT is used to find the next value. If thecursor is disassociated, then the cursor finds the OMT by searching fromthe root of the dictionary down to a leaf, using an allocated copy inRAM, and then associates the cursor with the OMT. Whenever the cursorsearches down the tree, it flushes the buffers it encounters.

To implement a point-query an associated cursor returns a copy of thekey-value pair it points to. A disassociated cursor returns a copy ofthe allocated key-value pair it contains. When a cursor searches, itoperates as follows:

-   -   1. Let u denote the root node. First bring u into RAM, if it is        not there already.    -   2. If u is a leaf-node, then look up the value in the OMT of        leaf by finding the first value that is greater than or equal to        the key-value pair being searched for. If looking for a matching        key in a DUP database, then the system must skip any leaf        entries that contain provisionally deleted values, and find the        first non-deleted value. If that skipping proceeds off the end        of the OMT, then the system sets u the parent, and tries to        examine the next child of the parent. If there are no more        children, then the system will return to the grandparent, and        look at the next child, and so forth, in this way finding the        next leaf node in the left-to-right ordering of the tree.    -   3. The system identifies the appropriate buffer and child of u        where k may reside. To do so, identify the largest pivot key        p_(i) in that node less than or equal to key k.    -   4. If there is no such key, then proceed with the leftmost        buffer and child of the root. Otherwise, proceed with the buffer        immediately after pivot key p_(i) and the child associated with        that buffer. Now search in that buffer for a message M(k,z).    -   5. If there are messages in buffer i, then flush those messages        to the next level of the tree. Flushing entails bringing the        node for child i into RAM, and moving the messages into the        appropriate buffers of the child.    -   6. The system sets u to the child i of u, and then proceeds to        step 2.

Thus as a successor (or predecessor) query proceeds along a root-to-leafsearch path, the system flushes each buffer that the search path travelsthrough. Once the search path reaches the leaf, the smallest key justlarger (or smaller) than k in the leaf is the successor (orpredecessor), with appropriate care taken for boundary cases where k isthe larger or smaller that all other keys in its leaf.

In more detail, when searching for k, the system first searches in theroot u₀. The system compares pivot keys, identifying the appropriatebuffer and child node u₁ to search next. The system flushes the buffercompletely, and then searches in child node u₁. At that node, the systemidentifies the appropriate buffer and child node u₂ to search next, andso forth down the tree. When the leaf node u_(l) is reached, the systeminserts a cursor at an element in that node and scans forward and/orbackward to return the predecessor and/or successor of k, visiting anadjacent leaf, as necessary.

Alternatively, there are other ways of satisfying predecessor andsuccessor queries. For example, here is a way to do so in which buffersare not flushed. In the nonleaf nodes, the system could maintain adictionary, including but not limited to a PMA (described below). Thedictionary could store keys and messages so that successor/predecessorqueries can be answered efficiently at each node. In effect, eachlogical cursor comprises a set of cursors, one at each node on theroot-to-leaf path. A successor/predecessor query on the logical cursorcomprises checking for a successor/predecessor at each node cursor andreturning the appropriate value (which will be the minimum/maximum ofthe successors/predecessors so computed).

One way to satisfy range queries is by using cursors. To implement arange query between two keys [k₁, k₂], first set a cursor to the key k₁,if it exists, and otherwise to the successor of k₁. Then increment thecursor, returning elements, until and element is found whose key isgreater than k₂.

Alternatively, the system can employ any correct implementation ofsuccessor/predecessor queries to implement correct range queryimplementation. The system could avoid not flushing buffers whenperforming queries, or the system could always flush buffers whenperforming queries. Avoiding flushing buffers can be used when a queryis read-only and do not change the structure of the tree. Alternatively,the system could preemptively flush all buffers affected by a querybefore answering the query.

Alternatively, range queries could be implemented in other ways. Forexample, the client could provide a function to apply to every key-valuepair in a range, and so the system could iterate over the tree and theOMT data structures to apply that function. Some such functions admit aparallel implementation. For example, if the function is to add up allthe values in the range, then since addition is associative, it can beperformed in a tree-structured parallel computation.

Packed Memory Array Supporting Variable-Size Elements

In some modes of operation, the system can store key-value pairs inanother dictionary data structure called a packed-memory arraysupporting variable-size elements (PMAVSE).

The packed memory array (PMA) data structure, is an array that storesunit-size elements in sorted order with gaps between the elements. A gapis an empty element in the array. A run of contiguous empty spacesconstitutes a number of gaps equal to the length of the run. Let Ndenote the number of elements stored in a PMA. The value of N may changeover time.

A PMA maintains the following density invariant: In any region of a PMAof size S, there are Θ(S) elements stored in it, and for S greater thansome small value, there is at least 1 element stored in the region.

Note: the “big-Omega” notation is used similarly to the big-Oh notationdescribed earlier. We say that f(n) is Ω(g(n)) if g(n) is O(f(n)). The“big-Theta” notation is the intersection of big-Oh and big-Omega. f(n)is Θ(g(n)) exactly when f(n) is O(g(n)) and f(n) is Ω(g(n)).

Alternatively, a PMA could use both upper and lower density thresholds.

To search for a given record x in a PMA, the system uses binary search.The binary search is slightly modified to deal with gaps. In particular,if a probe of a cell in the array indicates that that array position isa gap, then the system scans left or right to find a nonempty arraycell. By the density invariant, only a constant number of cells need tobe scanned to find a nonempty cell.

Alternatively, there are other ways of searching within the array withgaps. For example, one might use a balanced search tree or any of avariety of search trees optimized for memory performance, including butnot limited to a van Emde Boas layout, to index into the array. Theleaves of the index could be associated with some cells of the array.

The system rearranges elements in a subarray in an activity called arebalancing. Given a subarray with elements in it, a rebalancing of thesubarray distributes the elements in the subarray as evenly as possible.

To insert a given record y into a PMA, the system first searches for thelargest element x in a PMA that is less than y. If there is a gap in thearray directly after x, then put y into this gap. If there is no gapimmediately after x, then to make room for y, rebalance the elements ina certain subarray enclosing x.

To delete a given record x from a PMA as follows. First search for x andthen remove it from a PMA, creating a new gap. Then scan the immediateneighborhood of x. If there are more than a certain number of gaps nearx, then rebalance a certain subarray surrounding x.

If the entire PMA contains more than a certain number of elements thenthe system allocates a new array of twice the size of the old array, andcopies the elements from the old array into the new array, distributingthe elements into the array as evenly as possible. The old array is thenfreed.

Alternatively, the new array could be some other size rather than twicethe size of the old array. For example the new array could be 3/2 thesize of the old array.

If the entire PMA contains fewer than a certain number of elements thenthe system allocates a new array of half the size of the old array, andcopies the elements from the old array into the new array, distributingthe elements into the array as evenly as possible. The old array is thenfreed.

FIG. 44 shows an example of a PMA (4401) containing 16 array positionsand seven values (4402) distributed across the array.

FIG. 45 shows the same PMA from FIG. 44 into which an additional key(4502) with value equal to 28 has been inserted. There are now eightkeys (4502) stored in the PMA. In order to make room for value 28, aregion (4503) of the PMA was rebalanced, and the enclosed keys weredistributed evenly.

The system sometimes rebalances so that there are additional gaps nearareas that are predicted to have many insertions or few deletions in thefuture, and places fewer gaps near areas that are predicted to havefewer insertions or more deletions in the future.

The following terminology is used to describe the workings of a PMA inour system.

A subarray of a PMA is called a window. If W is a window then thefollowing definitions apply.

1. Define Capacity(W)=number of array cells in W.

2. Define NumElements(W)=number of filled array cells in W.

${3\text{.}\mspace{14mu} {Define}\mspace{14mu} {{Density}(W)}} = \frac{{NumElements}(W)}{{Capacity}(W)}$

When the array gets too sparse or too dense, it is either grown orshrunk by the factor of G, where G=2.

A smallest subarray that is involved in a rebalance is called a parcel.That is, an insertion that causes a rebalance must affect at least oneparcel. The size of a parcel is P.

The parameter A denotes the size of the entire array. That is,A=Capacity (entire PMA).

The maximum and minimum allowed densities of a PMA are denoted D(A) andd(A), respectively.

The maximum and minimum densities allowed in any parcel are denoted D(P)and d(P), respectively.

Several relationships between parameters are maintained.

D(A)≧G ² ·d(A)  1

-   -   This inequality says that if the elements are recopied from one        array (at density D(A)) into a larger array, then the new larger        array has density a factor of at least G larger than d(A) and a        factor of at least G smaller then D(A). The same holds true if        the elements are recopied from one array (at density d(A)) into        a smaller array.

d(P)<d(A)<D(A)<D(P)≦1  2

P=Θ(log(A)).  3

Alternatively, these parameters can be set to favor certain operationsover others.

A rebalance window has an upper density threshold and a lower densitythreshold, which together determine the target density range of a givenwindow. The density thresholds are functions of the window size. As thewindow size increases, the upper density threshold decreases and thelower density threshold increases.

When A is a power of two, the system can calculate density thresholds asfollows.

G=2

P=2^(c)

A=2^(c+h).

where c=Θ(log log A) and h=(lgA)−c, where lgA denotes the log-base-twoof A.

Thus for various values of l the parameters are set as follows:

${D\left( {2^{c}2^{l}} \right)} = {{\left( {{D(A)} - {D(P)}} \right)\left( \frac{l}{h} \right)} + {D(P)}}$${d\left( {2^{c}2^{l}} \right)} = {{\left( {{d(A)} - {d(P)}} \right)\left( \frac{l}{h} \right)} + {d(P)}}$

Consider a PMA having the following basic parameters:

G=2

A=512

P=16

D(P)=1.0

D(A)=0.5

d(A)=0.12

d(P)=0.07

The minimum and maximum density thresholds of subarrays are set asfollows:

D(P)=D(2³)=1.0

D(2⁴)=0.9

D(2⁵)=0.8

D(2⁶)=0.7

D(2⁷)=0.6

D(A)=D(2⁸)=0.5

d(A)=d(2⁸)=0.12

d(2⁷)=0.11

d(2⁶)=0.1

d(2⁵)=0.09

d(2⁴)=0.08

d(P)=d(2³)=0.07

It can be verified that all above properties hold.

For arbitrary values of G>1 the density thresholds of an window of sizeW are set as follows:

${D(W)} = {{\left( {{D(A)} - {D(P)}} \right)\left( \frac{1{g\left( {W/P} \right)}}{1{g\left( {A/P} \right)}} \right)} + {D(P)}}$${d(W)} = {{\left( {{d(A)} - {d(P)}} \right)\left( \frac{1{g\left( {W/P} \right)}}{1{g\left( {A/P} \right)}} \right)} + {d(P)}}$

A window is said to be within threshold if the density of that window iswithin the upper and lower density thresholds. Otherwise, it is said tobe out of threshold.

An insertion of an element y into a PMA proceeds as follows. First,search for the element x that precedes y in the PMA. Then check whetherthe density of the entire array is above threshold. If so, recopy allthe elements into another array that is larger by a factor of G.Otherwise check whether there is an empty array position directly afterelement x, and if so, insert y after x. Otherwise, rebalance to makespace for y as follows. Choose a window size W to rebalance as follows.Choose a parcel that contains x, and consider the parcel to be acandidate rebalance interval. If that candidate is within threshold,then rebalance, putting y after x during the rebalance. If not, thenarbitrarily grow the left and right extents of the candidate until thecandidate is within threshold. Then rebalance, putting y after x duringthe rebalance.

A deletion of an element x proceeds as follows. First, search for theelement x in the PMA and remove it. Then check whether the density ofthe entire array is below threshold. If so, then recopy all the elementsinto another array that is smaller by a factor of G. Otherwise choose aparcel that contained x, and call it a candidate rebalance interval. Ifthe candidate is within threshold then the deletion is finished,otherwise grow the left and right extents of the candidate until thecandidate is within threshold. Then rebalance the candidate.

Alternatively, there are many ways to choose candidate rebalanceintervals. For example, the candidates could be drawn from a fixed set(e.g., the entire array, the first and second halves, the four quarters,the eight eighths, and so forth). Another example is to choose therebalance window so that all the elements all move in the same direction(e.g., to the right) during the rebalancing.

Alternatively, there are several ways to implement a rebalance in-place.One way is to compress all the elements to one end of the rebalanceinterval and then put them in their final positions. This proceduremoves each element twice.

The rebalance can also be implemented so that each element only movesonce. The system divides the rebalance window into left-regions andright-regions. In a left region the initial position of the element isto the right of the final position and needs to be moved left. A rightregion is defined analogously. For each left region, move each elementdirectly to its final position starting from the leftmost element. Foreach right region, move each element directly to its final positionstarting from the rightmost element.

Now we explain how a PMAVSE operates. A PMAVSE supports elements thatcan have different sizes and also supports cursor operations and acursor set.

The PMAVSE comprises the following elements:

-   -   1. Cursor set—The set of cursors is stored as an unsorted array.    -   2. Record array—The record array is a PMA. Each element in this        record array comprises two or more pointers and a small amount        (also unit size) of auxiliary information. Each element thus has        unit size. Each element in the record array represents a        key-value pair stored in the PMAVSE. Specifically, the record        array stores the following:        -   (a) A pointer to the key p_(i)·k and the length of key            p_(i)·k.            -   The pointer points to a particular location in another                array, called the key array, in which the actual key is                stored.        -   (b) A pointer to the value p_(i)·v and the length of the            value.            -   The pointer points to a particular location in another                array, called the value array, in which the actual value                is stored.        -   (c) A flag indicating that the record has been deleted, but            that there still exists one or more cursors pointing to the            record. The record remains stored in the PMAVSE until all            cursors point elsewhere.    -   3. Key array—The keys are stored in a PMA-type structure,        modified to support different-length keys. The lower-bound        density thresholds are set to zero as d(A)=d(P)=0.    -   4. Value array—The keys are stored in a PMA-type structure,        modified, as with the key array, to support different-length        keys. The lower-bound density thresholds are set to zero as        d(A)=d(P)=0.

To search in the PMAVSE, perform a binary search on the record array.This binary search involves probes into the key array. To perform thebinary search for a given key-value pair, p_(j), use the record array tofind the middle element. Call the middle element p_(i). Then, use thekey array to compare p_(j)·k with p_(i)·k.

To perform an insert of a key p_(j)·k once the predecessor key p_(i)·khas been found, insert the new key into the key array and the new valueinto the value array. It remains to explain how to perform these newinsertions, because the keys and values have variable lengths.

Insertions into the key and value arrays use the same computation,except for the minor differences between storing keys and values. Thedescription here is for the key array. For example, in the system allkeys can be divided into bytes, which are used as a unit-length chunk.

The system divides the keys into unit-length chunks. Each unit-lengthchunk is inserted or deleted independently. This representation, wherekeys are split into independent unit-length chunks, is called here asmeared representation. A rebalance in the smeared representation iscalled here a smeared rebalance.

Refer to FIG. 5 for an example of a PMAVSE. The first array from the topis the cursor set (501). The second array from the top is the recordarray (502). The third array from the top is the key array (503). Thefourth array from the top is the value array (504). The cursor setstores two cursors (505), which indicate two of the four records 506 inthe record array (502). There are four keys (507) stored in the keyarray, aacab, ab, baaccb, and caa. There are four values (508) stored inthe record array, 10001, 0000000, 01, and 11011. The records, keys, andvalues are stored in sorted order, based on the keys.

The PMA insertion, deletion, and rebalance computations can be thus beused. To read keys and to perform functions including but not limited tostring comparison on the keys, the system regroups key chunks together,with the gaps removed.

The system can also store different-length keys without splitting thekeys into chunks. Instead, each key is stored in a single piece. Thisrepresentation is called here a squished representation.

The system rebalances the PMA as follows. Find the appropriate rebalanceinterval. Proceed as in a PMA using smeared representation—grow arebalance interval until it is within threshold. Then rebalance theelements in the smeared representation. Then squish the elements, i.e.,store the unit-size chunks continuously, that is, with no gaps inbetween chunks. This rebalance of the elements in the smearedrepresentation can be performed implicitly or explicitly.

Squish the gaps as follows. If the entire element is contained in therebalance interval, then squish the smeared key evenly from both sidesso that half of the gaps go before the squished element and half goafter (up to a roundoff error if there is an odd number of chunks).

Refer to FIG. 6 for an example of how to rebalance in the smeared andsquished representations. The top array is a key array before arebalance in the squished representation (601). The middle array is thekey array after a rebalance in the smeared representation (602). Thebottom array is the key array after a rebalance in the squishedrepresentation (603). The rebalance interval is indicated in all threearrays (608). There are six keys being rebalanced, AA, BBBBB, CCCC, D,E, and FFFF. Consider element CCCC which in the unbalanced array (601),is partially smeared (605). In the smeared representation (602), theelement CCCC (606) is smeared into four chunks, with gaps between eachchunk. In the squished representation (603), the gaps are squeezed outof the element (607), with roughly half before and half after. If onlypart of the element is in the rebalance interval then that element isnot moved. Thus, all of the gaps are squeezed out on the size containedin the rebalance interval. See element BBBBB in the smearedrepresentation (602). Only three chunks from this element are in therebalance interval. Consequently, all of the gaps have to be squishedout to the right, so that the element contains no gaps.

The smeared rebalance can be performed implicitly, rather thanexplicitly. An element that is only partially located within therebalance interval does not move at all. To move an element that isentirely contained in a rebalance interval, place the middle unit-sizechunk, or middle two unit-size chunks in the placement of the smearedrepresentation. Next, place the rest of the chunks so that all the gapsare squeezed out.

The PMA stores a set of cursors. The system stores the cursors unorderedin an array.

Whenever an element in the PMAVSE shifts around, all cursors pointing tothat element are updated. This update involves a scan through the cursorset every time there is a rebalance. An element is not removed from thePMAVSE while there are one or more cursors pointing to it. Instead, theelement remains in the PMA with a flag indicating that it has beendeleted. Eventually, when no cursors point to the element, it isactually removed.

Alternatively, there are other data structures for storing cursors. Forexample, the cursors could be stored in an ordered list where theelements have back pointers to the cursor list. Then each element wouldcontain a list of pointers to the cursors at that element. Thisrepresentation guarantees that one never has to traverse many cursors tofind all cursors that have to be updated on a rebalance.

Alternatively, the cursors could also be stored in any dictionarystructure, including but not limited to a sorted linked list, a balancedsearch tree, a streaming disk-resident dictionary, or a PMA, ordered bythe elements that they point to, with no back pointers.

File Header

The system stores each dictionary in a file. At the beginning of thefile are two headers, each of which comprise a serialization comprising

-   -   1. a literal string “tokudata”,    -   2. a version number,    -   3. a number indicating the size of the header, stored in a        canonical order (most significant byte first),    -   4. a check sum,    -   5. a number used to determine whether the system is storing data        in big-endian or little-endian order, or some other order,    -   6. a number indicating how many checkpoints are stored    -   7. an offset in the file at which a block translation table        (BTT) is stored,    -   8. a number indicating the disk block number of the root of the        tree,    -   9. a number encoding the LSN of the operation that most recently        modified the tree rooted at the tree, and    -   10. a string that encodes dictionary-specific data (including        but not limited to the type of each column in the dictionary).

The root block number along with the BTT provide information for anentire tree. The root block number can be translated using the BTT to asegment. The segment in turn may contain block numbers of children,which are translated by the BTT. Two completely different trees may bereferred to by different headers, since the BTTs may map the same blocknumbers to different segments, and the two trees may share subtrees (orthe entire trees may be the same), since their respective BTTs may mapthe same block number to the same segment.

Alternatively, multiple dictionaries can be stored in one file, or adictionary can be distributed across multiple files, or severaldictionaries can be distributed over a collection of files. For example,for implementations that use multiple files for one or moredictionaries, the block translation table can store a file identifier aswell as an offset in each block translation pair of a block translationarray (3009).

Alternatively, more than two headers can be employed. For example, totake a snapshot of the system, a copy of the BTT and header can bestored somewhere on disk, including but not limited to in a third headerlocation. The system could maintain an array of headers to managearbitrarily many snapshots.

Buffer Pool

The system employs a buffer pool which provides a mapping between thein-RAM and on-disk representations of tree nodes. When a node is broughtinto RAM, it is pinned. When a node is pinned, it is kept in RAM untilit is unpinned. Pinning a node is a way of informing the system to keepa node in RAM so that it can be manipulated. A node can have multiplesimultaneous pins, since multiple functions or concurrent operations canmanipulate a tree node.

To pin a node in RAM, the system first checks whether that node isalready in the buffer pool and if not, bring it into RAM. Then thesystem updates a reference count saying how many times the node has beenpinned. A node can be removed from RAM when the reference count reacheszero.

When a node is transferred from disk into RAM, the size of the in-RAMrepresentation is calculated. Then the system constructs the in-RAMrepresentation of the node.

The buffer pool provides a function getandpin which given a block numberpins the corresponding node in RAM, bringing it into RAM if it is notalready there.

The buffer pool also provides a function maybegetandpin, which pins thenode only if it is already in RAM. The system employs maybegetandpin todecide whether to move data from one node to another depending onwhether the second node is in RAM.

The system also employs maybegetandpin to control aggressive promotion.In one mode, the system aggressively promotes messages into anyin-memory node. In another mode, the system aggressively promotesmessages only to dirty in-memory nodes.

When the total size of the nodes in RAM becomes larger than the bufferpool's allocated memory, the system may evict some nodes from RAM. Thesystem can evict the least recently used unpinned node from the bufferpool. To evict a node, the node is deleted from RAM, first writing it todisk if the node is dirty.

A node, block, or region of RAM is defined to be dirty if it has beenmodified since being read from disk to RAM.

Alternatively, there are other ways to optimize the page-evictionstrategy in the buffer pool. The decision of which node to unpin can beweighted by one or more factors, for example, the size of the node, theamount of time that the node has been ready to be ejected, the number oftimes that the node has been pinned, or the frequency of recent use.

A Buffer Pool (4601) is a structure comprising

-   -   1. n_in_table (4602), a number indicating how many nodes are        stored in the buffer pool;    -   2. table (4603), an array of pointers to pairs (each array        element is called a bucket, and the table itself acts as a hash        table);    -   3. table_size (4604), a number indicating how many buckets are        in the table;    -   4. lurlist (4605), a doubly linked list threaded through the        pairs, ordered so that the more recently used pairs are ahead of        the less recently used pairs;    -   5. cachefile_list (4606), a list of pointers to cachefiles;    -   6. size (4607), a number which is the sum of the in-RAM sizes of        the nodes in the buffer pool;    -   7. sizeilimit (4608), a number which is the total amount of RAM        that the system has allocated for nodes;    -   8. mutex (4609), a mutual exclusion lock (mutex);    -   9. workqueue (4610), a work queue; and    -   10. checkpointing (4611), a Boolean which indicates that a        checkpoint is in progress.

A buffer pool pair is a structure comprising

-   -   1. a pointer to a cachefile;    -   2. a block number;    -   3. a pointer to the in-RAM representation of a node;    -   4. a number which is the size of the in-RAM representation of        the node;    -   5. a Boolean, dirty, indicating that the node has been modified        since it was read from disk;    -   6. a Boolean, checkpoint_pending, indicating that the node is to        be saved to disk as part of a checkpoint before being further        modified;    -   7. a hash of the block number;    -   8. a pointer, hash_chain which threads the pairs from the same        bucket into a linked list;    -   9. a pair of pointers, next and prev, which are used to form the        doubly linked list ordered by how recently used each node is;    -   10. a readers-writer lock; and    -   11. a work list comprised of work items, each item comprising a        function and an argument.

A cachefile is organized so that it is in one-to-one correspondence withthe open dictionaries. A cachefile is a structure comprising

-   -   1. refcount, a number, called a reference count, which is        incremented every time a dictionary is opened, every time a        rollback entry is logged, and any time any other use is made of        the cachefile to prevent the cachefile from being closed until        all uses of the cachefile have finished;    -   2. fd, a number which is a file descriptor for the file that        holds the on-disk data;    -   3. filenum, a number which is used to number files in the        recovery log;    -   4. fname, a string which is the name of the file;    -   5. a pointer to the header for the file;    -   6. a pointer to the BTT for the dictionary;    -   7. a pointer to the CBTT for the dictionary; and    -   8. a pointer to the TBTT for the dictionary.

A work queue is a structure comprising

-   -   1. a doubly linked list of work items;    -   2. a condition variable called wait_read;    -   3. a number called want_read;    -   4. a condition variable called wait_write;    -   5. a number called want_write;    -   6. a Boolean indicating that the work queue is being closed; and    -   7. a number which counts the number of work items in the list.

To enqueue a work item onto a work queue, the system performs thefollowing operations:

-   -   1. Lock the work queue.    -   2. Increment the counter.    -   3. Put the work item into the doubly linked list.    -   4. If want_read>0 then signal the wait_read condition.

5. Unlock the work queue.

To dequeue a work item from a work queue, the system performs thefollowing operations:

-   -   1. Lock the work queue.    -   2. While the work queue is empty and the Boolean indicates that        the queue is not closed        -   (a) Increment want_read.        -   (b) Wait on the wait_read condition.        -   (c) Decrement the want_read.    -   3. Decrement the counter.    -   4. Remove a work item from the doubly linked list.    -   5. Unlock the work queue.

In some cases the locking and unlocking steps are be skipped. Forexample, if the work queue is being filled before any worker threads areinitialized.

When a buffer pool is created, a set of worker threads is created. Eachthread repeatedly dequeues a work item from the work queue (waiting ifthere are no such items), and then applies the work item function to thework item. In some cases, the system decides that there is a largebacklog of work items, and prevents additional writes into the bufferpool, using the want_write condition variable.

In some cases a thread writes a node to disk directly. In other cases, athread schedules a node to be written to disk. For example, when readingone node, if the buffer pool becomes oversubscribed, the systemschedules the recently used node to be written to disk by enqueuing awork item. That enqueued work item, when run, obtains a writer lock onthe pair, and writes the node to disk.

When a dictionary is open in the buffer pool, a cachefile is associatedwith the dictionary. When a dictionary is opened, the system find thecurrently associated cachefile (in which case the reference count isincremented), or creates a new cachefile. In the case where a newcachefile is created, the system opens a file descriptor, and storesthat in the cachefile. The system stores the file name in the cachefile.The system allocates a file number, and logs the association of the filenumber with the path name. If the file exists, then the header is readin, a header node is created, and the pointer to the header isestablished. If the file does not previously exist, a new header iscreated.

When a dictionary is closed, the reference count is decremented. Whenthe reference count reaches zero, the system

-   -   1. flushes any pairs that belong to that cachefile, writing them        into the file;    -   2. waits for any pairs in the work queue to complete;    -   3. writes the header to the file;    -   4. removes the cachefile from the linked list of cachefiles;    -   5. closes the file descriptor;    -   6. deallocates the RAM associated with the cachefile; and    -   7. performs any additional housekeeping that is needed to close        the cachefile.

To perform a getandpin operation of a node, the system computes a hashon the block number, and looks up the node in the hash table. If thenode is being written or read by another thread, the system waits forthe other thread to complete. If the node is not in the hash table, thesystem reads the node from disk, decompressing it, and constructs thein-RAM representation of the node. Once the node is in RAM, the systemmodifies the least-recently-used list, and acquires a reader lock on thepair. If the checkpoint_pending flag is TRUE, then system

-   -   1. writes the node to disk (updating the BTT),    -   2. also updates the temporary BTT used for the node's        dictionary, (the temporary BTT is created, for example, during a        checkpoint), and    -   3. sets the checkpoint_pending to FALSE        before returning from getandpin.

If the buffer pool hash table ever has more nodes in the buffer poolthan there are buckets in the hash table, the system doubles the size ofthe hash table, and redistributes the values. Each pair p has a hashvalue h(p) stored in it. If the length of the table is n, then p isstored in bucket h(p) mod n.

When storing a node n from cachefile c that was previously not in thebuffer pool, a buffer pool pair is created pointing at c and n. The pairis initialized to hold the block number of the node. The dirty bit isinitially set to FALSE if the node is being read from disk, and set toTRUE if the node is being created for the first time (not previously ondisk). The hash of the block number is stored, and the node is put atthe head of the least-recently-used list. And the node is inserted intothe appropriate bucket in the hash table.

For each nonleaf node in RAM, the system maintains the hashes of theeach of the nodes' children (in childfullhash (2614)), which can help toavoid the need to recompute the hash function on the node.

Alternatively, the system could use different buffer-pool constructions.For example, the system could build a buffer pool based on memorymapping (e.g., the mmap( ) library call), or instead of using a hashtable, an OMT could be used.

In some modes of operation the system maintains the invariant that if anode is pinned then its parent is pinned. The system maintains thisinvariant by keeping a count of the number of children of a node thatare in RAM, and treating any node with a nonzero count as pinned. Thechildren can maintain a pointer to the in-RAM representation of theparent.

Whenever the tree's shape changes (for example when a node is split) thecounters and the parent pointers are updated.

This invariant can be useful when updating fingerprint and the estimatesof the number of data pairs, the number of distinct keys, and the numberof data pairs. The estimates are propagated up the tree when just beforea node is evicted, rather than on every update to the node.

In some modes of operation, the system propagates data upward every timeany node is updated, and does not need to maintain the invariant at alltimes, but only needs to maintain the invariant when a child node isactually being updated.

Data Descriptors

The system employs a byte string called a data descriptor that describesinformation stored in a dictionary. The descriptor comprises a versionnumber and a byte string. Associated with each dictionary is adescriptor.

The system uses descriptors for at least two purposes.

-   -   1. For comparison functions. The system uses the same C-language        function to implement comparisons in different dictionaries. The        C-language function uses the descriptor associated with a        dictionary two compare two key-value pairs from that dictionary.        The descriptor includes information about each field in a key.        For example, the descriptor could contain information that the        first field of a key is a string which should be sorted in        ascending order, and the second field is an integer which should        be sorted in descending order.    -   2. For generated derived rows. In one mode, the system maintains        at least two dictionaries. One dictionary is a primary        dictionary, and a second dictionary is a derived dictionary. For        each key-value pair in the primary dictionary, the system        automatically generates a key-value pair for the derived        dictionary. For example, if a primary dictionary pairs comprise        a first name, a last name, and a social security number, then in        a secondary dictionary the pairs might comprise the social        security number then the last name.        Thus a descriptor describes the types and sort order for each        field in a key-value pair, and for derived dictionaries, a        descriptor further describes which fields from a primary row are        used to populate a derived row.

The system upgrades descriptors incrementally. The system organizes eachdictionary into one or more nodes. Each node contains the version numberof the descriptor for rows stored in that node. If the users of thesystem need to change the descriptor for a dictionary, the olddescriptor and the new descriptor are both stored in the header of thedictionary. When a node is read in, if the descriptor version for thatnode is an old version, then the system calls a user-provided upgradefunction to upgrade all the pairs stored in that node.

On-Disk Encoding and Serialization

To write data to disk, the system first converts a node into aserialized representation (an array of bytes), in much the same way thatmessages are converted into an array of bytes. Then the data iscompressed. Then a node header is prepended to the compressed data, andthe node header and compressed data are written to disk as a singleblock.

A node, as written to disk, comprises the following serializedrepresentation:

-   -   1. a literal string “tokuleaf” or “tokunode” depending on        whether the node is a leaf node, or a nonleaf node;    -   2. a number indicating which file version the node is, which can        facilitate changing the encoding of a block in future versions        and can facilitate the reading of older versions of the block;    -   3. the dictionary's descriptor version;    -   4. nodelsn (411);    -   5. the compressed length of the compressed subblock that        follows;    -   6. the uncompressed length of the compressed subblock that        follows;    -   7. a compressed subblock, comprising the following information,        that is then compressed as a block as        -   (a) a target size of the node (which defaults to 4            megabytes),        -   (b) isdup (403),        -   (c) height (405),        -   (d) randfingerprint (406), and        -   (e) localfingerprint (407).

For leaf nodes, the statistics (414) can be represented on disk byrecalculating all the values as the leaf node is read in from memory.That is, the system can encode a leaf nodes statistics using no bits ondisk.

After the localfingerprint (407), leaf nodes are additionally serializedby encoding

-   -   1. the number of leaf entries in the node, and    -   2. for each leaf entry, from least to most in sorted order, the        serialized leaf entry.

After the localfingerprint (407), nonleaf nodes are additionallyserialized by encoding

-   -   1. statistics (414), which are encoded by        -   (a) a number ndata (3301),        -   (b) a number ndata_error_bound (3302),        -   (c) a number nkeys (3303),        -   (d) a number nkeys_error_bound (3304),        -   (e) a number minkey (3305),        -   (f) a number maxkey (3306), and        -   (g) a number dsize (3307);    -   2. the subtree fingerprint, which is the sum of the fingerprints        of the children; and    -   3. the number of children.    -   4. For each child the nonleaf nodes further encode        -   (a) the subtreefingerprint (2611) of the child,        -   (b) the stored statistics for child, encoded as for the node            statistics (414) in Item 1 above,        -   (c) the block number of the child,        -   (d) the FIFO buffer of the child, represented by            -   i. the number of entries in the FIFO buffer,            -   ii. for each message in the FIFO buffer, from oldest to                newest, the serialized representation of the message.    -   5. For each pivot key the nonleaf nodes further encode        -   (a) the key of the pivot key (encoded as a length followed            by the bytes of the key), and        -   (b) for DUP dictionaries the value of the pivot key (encoded            as a length followed by the bytes of the key).

After the previously encoded information, each node further encodes achecksum for the all of the data including the uncompressed node headerand the compressed subblock. This checksum is computed on the compressedsubblock before the data is compressed, so that the system can verifythe checksum after the data has been decompressed after beginning readfrom disk. The checksum is the end of the compressed block.

Alternatively, data can be represented on disk in other ways. Forexample minkey (3305) can be eliminated from the on-disk representationif the system takes care to make sure that the pivot keys actuallyrepresent a value present in the left subtree.

In one mode of operation, the system compresses blocks using a parallelcompression computation. In this case, instead of storing the compressedlength and uncompressed of the subblock, the system divides the subblockinto N subsubblocks, and stores the value N. Each subsubblock can becompressed or decompressed independently by a parallel thread. Thecompressed and uncompressed lengths of the subsubblocks are stored.

Alternatively, the system can choose how much processing time to devoteto compression. For example, if the system load is low, the system canuse a compression computation that achieves higher compression.

In one mode, the system adaptively increases the target size of nodesdepending on the effectiveness of compression. If a block has never beenwritten to disk, the system sets the block target size to 4 megabytes (4MB). When a block is read in, the system remembers the compressed size.For example, if the block was 3 MB of uncompressed data and required 0.5MB after compression, then the block was compressed at 6-to-1, and sothe system increases the target size from its default (4 MB) by a factorof 6 to 24 MB. When a block is split, both new blocks inherit thecompression information from the original block. If later data isinserted that has more entropy, then when the data is written to disk, anew compression factor is computed, and the block will be split at asmaller size in future splits.

Alternatively, the system could use other ways to implement compression,depending on the specifics of the node representations. For example,each leaf entry or message could be compressed individually.Alternatively, the leaf entries or messages could be compressed insubblocks of the node. If the dictionary is used in a database organizedas rows and columns, the keys and values may have finer structure(including but not limited to fields that represent columns). In such acase, a system can separate the fields and store like fields togetherbefore compressing them.

Alternatively, other representations of tree nodes could be used. Forexample, the data could be stored in compressed and/or encrypted form ondisk. The data can be stored in a different order. The target node sizeneed not be 4 MB or even any particular fixed value. It need not beconstant over the entire tree, but could depend on the particularstorage device where the node is located, or it could depend on otherfactors such as the depth of the node within the tree.

Alternatively, there are other ways of building in-RAM representations,permitting fast searches and updates of key-value pairs in nodes andnodes' buffers. For example, instead of using a FIFO queue in eachbuffer, one could use a hash table or OMT in a buffer, and mergemessages at nonleaf nodes of the tree, and on look up to sometimes getvalues directly out of messages stored at nonleaf nodes. Two or moremessages could be merged into one message. A packed-memory array couldbe used instead of a hash table or OMT.

A block translation table is serialized by encoding

-   -   1. a number indicating the size of the block translation table;    -   2. for each block translation pair        -   (a) the disk offset of the block translated by the pair            (encoded as −1 for unallocated block numbers), and        -   (b) the size of the block (encoded as −1 for unallocated            block numbers); and    -   3. a checksum.

That information is enough to determine all the information needed inthe block translation table. For example, the set of free segments arethose segments which are not allocated to a block.

For each dictionary, the system serializes the following information atthe beginning of the file containing the dictionary:

-   -   1. The literal string “tokudata”.    -   2. The layout version, stored in network order.    -   3. The size of the header, stored in network order.    -   4. A byte-ordering literal, which is a 64-bit hexadecimal number        0x0102030405060708 which the system uses to determine the byte        order for the data on disk including but not limited to        big-endian or little-endian. Many integers are stored in the        byte order consistent with the byte-ordering literal.    -   5. A count of the number of checkpoints in which the dictionary        is participating.    -   6. The target node size for the dictionary, which defaults to        2²², which is 4 mebibytes.    -   7. The location of the BTT on disk.    -   8. The size of the BTT.    -   9. The block number of the root of the dictionary.    -   10. isdup (403).    -   11. A “old” layout version, used to maintain the oldest layout        version used by any node in the dictionary.    -   12. A checksum.

File Names and File Operations

The system uses a level of indirection for dictionary file names.Associated with each dictionary are two names, a dname and an iname.

Dnames are the logical names of the dictionaries. Inames are the filenames. The system maintains a dictionary called the dname-inamedirectory as a NODUP dictionary. The directory maps dname to iname,where dname is the key and iname is the value. A dname and an iname bothof the syntax of a pathname.

An iname is a pathname relative to the root of a file directoryhierarchy, which is the structure called an environment, containing allthe dictionaries of a particular storage system. The iname is the nameof a file in a file system. In most situations where a dictionary isrenamed, the system does not rename the underlying file, but insteadtreats inames as immutable. Every iname is unique over the lifetime ofthe log. This uniqueness is enforced by embedding the XID of the filecreation operation in the iname. In one mode, the iname is a 16-digithex number with a .tokudb suffix. In another mode the name contains ahint to the original user name, for exampletablename.columnname.01234567890ABCDE.tokudb where tablename is the nameof the table, columnname is the name of a column being indexed, and01234567890ABCDE is a hexadecimal representation of the XID.

Most file operations occur within transaction. The close operation is anon-transactional file operation.

The iname-dname directory uses the string comparison for its comparisonfunction, and has no descriptor.

The iname-dname directory is a dictionary. The system appliescheckpointing, logging, and recovery to the dictionary. The directory isrecovered like any other dictionary.

The system logs fassociate (4703) entry in the recovery log when itopens the directory.

When performing file operations, the system typically takes on or morelocks on the directory. For example, when renaming a file, an exlusivelock on the old dname and the new dname is acquired. The lock is helduntil the transaciton completes.

The recovery log contains dnames for the purposes of debugging andaccountability, stored for example, in comment fields.

On system start up, the system receives three pathnames from aconfiguration file, command line argument, or other mechanism.

-   -   1. envdir, the environment pathname,    -   2. datadir, the pathname of the filesystem directory where the        dictionaries are stored.    -   3. logdir, the pathanem of the filesystem directory which holds        the recovery log files.

All new data dictionaries are created in datadir.

The datadir is relative to the environment envdir, unless it isspecified as an absolute pathname.

All inames are created as relative to the envdir, inside the datadir.The pathname stored in datadir will be the prefix of the pathname in theiname.

The envdir is relative to the current working directory of the processrunning the system, unless it is specified as an absolute pathname.

If the system is shut down and then restarted with a new datadir then

-   -   1. New dictionaries are created in the new datadir.    -   2. Old dictionaries, accessed by iname, are still available in        their original directories.    -   3. The implicit envdir prefixed to iname. That is, the full        pathname is envdir/iname.    -   4. inames stored in log are of the form        original_data_dir/original_iname.    -   5. inames stored in iname-dname directory are of the same form.

When the system performs a file operation, except for close, the systemcreates a child transaction in which to perform the file operation. Ifthe child transaction fails to commit, then the file operation isundone, making the file operations atomic. Every file operationcomprises the following steps:

-   -   1. Begin a child transaction.    -   2. Perform operation    -   3. If the operation failed, abort the child transaction.    -   4. if the operation succeeded, commit the child transaction        without fsyncing the log.        There is no child transaction for file-close.

For all the operations described below, the commit actions are performedwhen the topmost ancestor transaction commits.

Create or Open Dictionary

Opening a dictionary inserts an fopen (4710) entry in the recovery log.There is no fopen entry in the rolltmp log. Creating a dictionaryinserts fcreate (4709) entry in the recovery log, followed by a f open(4710) entry if the dictionary is to be opened. When recovery iscomplete, all dictionaries are closed. After recovery, the iname-dnamedirectory is opened before performing new post-recovery operations.

To create or open file the system performs the following operations:

-   -   1. Examine the iname-dname directory to see if dname exists.    -   2. Take a lock on the dname in the directory.    -   3. Take a write lock if the file is being opened in        create-exclusive mode.    -   4. Take a read lock otherwise.    -   5. Terminate with an error if:        -   (a) dname is found and the operation is to create the file            in an exclusive mode, or        -   (b) dname is not found the operation is to open an existing            file.    -   6. If creating a file and dname is not found:        -   (a) Take a write lock on dname in the iname-dname directory,            if the write lock has not been acquired earlier.        -   (b) Generate an iname using the XID of the child            transaction.        -   (c) Insert a key-value pair in iname-dname directory, INSERT            (dname, iname)        -   (d) Log the file creation:            -   i. Generate an LSN.            -   ii. Log a fcreate entry (with dname and iname).            -   iii. fsync the log.        -   (e) Create the file on disk using iname.        -   (f) Make an fcreate entry in the rolltmp log.    -   7. Log the fopen, without fsync.    -   8. Open the dictionary.    -   9. If file was just created, take full range write lock on new        dictionary.

When the system aborts an file-open operation, aborting the transactionimplicitly will undo the operations on the directory.

To abort fcreate the system performs the following operations:

-   -   1. Delete the iname file.    -   2. The iname-dname directory will be cleaned up by the abort. It        is not necessary to explicitly modify the directory.    -   3. The dictionary will be closed implicitly by aborting the        transaction.

During recovery, in backward scan for fcreate, the system performs thefollowing operation:

-   -   1. Close the file if it is open.

During recovery, in backward scan for fopen, the system performs thefollowing operations

-   -   1. Close the file if it is open.

During recovery, in forward scan for fcreate, the system performs thefollowing operations:

-   -   1. If transaction does not exist (because topmost parent XID is        older than oldest living transaction) do nothing.    -   2. Else        -   (a) Before reaching the begin-checkpoint record for the            oldest complete checkpoint.            -   i. If file does not exist, then the file has been                deleted, so do nothing.            -   ii. If file does exist, record create in transaction's                rollback log and open the file.        -   (b) After reaching begin-checkpoint record: (The file            creation was after the checkpoint, so the file may not even            exist on disk in the event of certain kinds of system            failures.)            -   i. Delete the file if it exists.            -   ii. Create and open the file, recording the creation in                transaction's rollback log.        -   (c) The iname-dname directory will be recovered on its own.

During recovery, forward scan for fopen, the system performs thefollowing operations:

-   -   1. Open the dictionary (using iname for pathname). If the file        is missing, then ignore the fopen and ignore any further        references to this file.

Close Dictionary

To close a dictionary the system performs the following operations:

-   -   1. Log the close operation.    -   2. Close the dictionary.

Delete Dictionary

To delete a dictionary the system performs the following operations:

-   -   1. Find the relevant entry in the directory and get the iname.        This operation takes a write lock on the key/name pair in the        directory by passing in the a read-modify-write flag called        DB_RMW.    -   2. If dictionary is open, return error.    -   3. Delete the entry from the iname-dname directory.    -   4. Make an entry in the rolltmp log.    -   5. Mark the transaction as having performed a delete.    -   6. Log an entry in the recovery log.

To commit, the system performs the following operations:

-   -   1. If this transaction deleted a dictionary, write an committxn        (4705) entry to the recovery log and fsync the log.    -   2. Delete the iname file if it exists.

To abort requires no additional work. The directory will be cleaned upby the abort. It is not necessary to explicitly modify the directory.

During recovery, forward scan, the system performs the followingoperations:

-   -   1. If transaction does not exist do nothing.    -   2. Else create a rolltmp log entry. (The file will be deleted        when the transaction is committed.)

4.0.1 Rename Dictionary

To rename a dictionary the system performs the following operations.

-   -   1. Record the rename as a comment in the log.    -   2. If dictionary is open, return error.    -   3. Delete the old entry from the directory. This operation fails        if the dname is not in the directory and otherwise it take a        write lock on the entry using the DB_RMW flag.    -   4. Insert the new entry into the iname-dname directory.

To abort requires no additional work. The directory will be cleaned upby the abort. It is not necessary to explicitly modify the directory.

SQL Database Operations

When the system is operating as a SQL database, the database tables aremapped to dnames, which are in turn mapped to inames. In a database, atable comprises one or more dictionaries. One of the dictionaries servesas the primary row store, and the others serve as indexes.

The SQL command RENAME TABLE is implemented by the following steps:

-   -   1. Begin a transaction.    -   2. Create a list of dictionaries that make up the table.    -   3. For each dictionary:        -   (a) Close the dictionary if open.        -   (b) Rename the dictionary.    -   4. Commit the transaction.    -   5. If dictionaries are expected to be open, open them.

The SQL command DROP TABLE is implemented by the following steps:

-   -   1. Begin a transaction.    -   2. Create a list of dictionaries that make up the table.    -   3. For each dictionary:        -   (a) Close the dictionary if open.        -   (b) Delete the dictionary.    -   4. Commit the transaction.

The SQL command CREATE TABLE is implemented by the following steps:

-   -   1. Begin a transaction.    -   2. Create a list of dictionaries that make up the table.    -   3. For each dictionary:        -   (a) Create the dictionary.        -   (b) Close the dictionary.    -   4. Commit the transaction.    -   5. If dictionaries are expected to be open, open them.

The SQL command DROP INDEX is implemented by the following steps:

-   -   1. Begin a transaction.    -   2. Delete the dictionary corresponding to the index.    -   3. Commit the transaction.

The SQL command ADD INDEX is implemented by the following steps:

-   -   1. Begin a transaction.    -   2. Create a dictionary for the index.    -   3. Populate the dictionary with index key-value pairs.    -   4. Close the dictionary.    -   5. Commit or abort the transaction.    -   6. If successful, open new dictionary.

The SQL command TRUNCATE TABLE, when there is no parent transaction, isimplemented by the following steps:

-   -   1. Begin a transaction.    -   2. Acquire metadata (including dname, settings, descriptor).    -   3. For each dictionary in the table:        -   (a) Close the dictionary if open.        -   (b) Delete the dictionary.        -   (c) Create a new dictionary with same metadata.        -   (d) Close the dictionary    -   4. If success, commit the transaction, else abort the        transaction.    -   5. If dictionaries are expected to be open, open them.

Logging and Recovery

The log comprises a sequence of log entries stored on disk. The systemappends log entries to the log as the system operates. The log isimplemented using a collection of log files to form the log. The logfiles each contain up to 100 megabytes of logged data. As the systemoperates, it appends information into a log file. When the log filebecomes 100 megabytes in size, the system creates a new log file, andstarts appending information to the new file. After a period ofoperation, the system may have created many log files. Some of the olderlog files are deleted, under certain conditions described below. Some ofthe log files may be stored on different disk drives, some may be backedup to tape. The system thus divides the log into small log files, namingeach small log file in a way that will make it possible to identify thelogs during recovery, and manages the log files during normal operationrecovery. The large abstract log can also be implemented by writingdirectly to the disk drive without using files from a file system. Inthis description, we often refer to a single log, with the understandingthat the log may be distributed across several files or disks. The logdata could be stored on the same disk drive or storage device as theother disk-resident data, or on different disks or storage devices. Wedistinguish the log file from the other disk-resident data by referringto the log separately from the disk. In some cases, log entries arestored in the same files that contain the other data.

The log is a sequence of log entries. A log entry is a sequence offields. The first field is a single byte called the entry type. Theremaining fields depend on the entry type. Every log entry begins andends with the length, a 64-bit integer field which indicates the length,in bytes, of the log entry. The system can traverse the log in theforward or reverse direction by using the length, since the length fieldat the end makes it easy, given a log entry, to find the beginning ofthe previous log entry.

Every log entry further includes a checksum which the system examineswhen reading the log entry to verify that the log entry has not beencorrupted.

The system defines the following log entry types which are serializedusing similar techniques as for encoding messages. Every log entrybegins with a LSN (4722), then includes a entrytype (4723). The systemimplements the log entries depicted in FIGS. 47-51. In those figures,the length information at the beginning and end of each log entry arenot shown, nor is the checksum shown.

-   -   1. The system logs a checkpoint_begin (4701) when a checkpoint        begins. It includes a timestamp (4724) field which records the        time that the checkpoint began.    -   2. The system logs a checkpoint_end (4702) when it completes a        checkpoint. This log type comprises lsn_of_begin (4725), which        is the LSN of the the checkpoint_begin (4701) entry that was        recorded when the checkpoint began, and a timestamp (4724),        which records the time that the checkpoint ended. We say that        the previous checkpoint_begin (4701) entry corresponds to the        checkpoint_end (4702) entry.    -   3. The system logs a fassociate (4703) when it opens a file.        Also when the system performs a checkpoint, the system records a        fassociate (4703) for every open file. This log entry comprises        a file number filenum (4726) and a file name filename (4729).        The system uses the filenum (4726) in other log entries that        refer to a file. This log entry further comprises an integer        flags (4727), which is an integer, to record information about        the file, for example 2345 whether the dictionary contained in        the file allows duplicate keys.    -   4. The system logs a txnisopen (4704) when a checkpoint starts,        for each open transaction. This log entry type records the fact        that a particular transaction, identified by transaction_id        (2812), is open. This log entry comprises transaction_id (2812),        which is the same as the LSN (4722) of the begintxn (4707) log        entry that was logged when the transaction was opened. This log        entry further comprises another XID parenttxn (4728), which is        the XID of the transactions parent if the transaction is has a        parent in a nested transaction hierarchy. If the transaction has        no parent then a special NULL XID is logged in the parenttxn        (4728) field.    -   5. The system logs a committxn (4705) when it commits a        transaction. This log entry comprises transaction_id (2812),        which identifies the transaction that is being committed.    -   6. The system logs a aborttxn (4706) when it aborts a        transaction. This log entry comprises transaction_id (2812)        which identifies the transaction that is being aborted.    -   7. The system logs a begintxn (4707) when it begins a        transaction. The transaction can thereafter be identified by the        LSN (4722) value that was logged. This log entry comprises the        XID parenttxn (4728) of the parent of the transaction, if the        transaction is a child in a nested transaction hierarchy. If        there is no parent, then a special NULL XID is logged in the        parenttxn (4728) field.    -   8. The system logs a fdelete (4708) when it deletes a file. This        log entry comprises transaction_id (2812) which indicates which        transaction is performing the deletion. This log entry further        comprises a file name filename (4729) indicating which file to        delete. If the transaction eventually commits, then this        deletion will take effect, otherwise this deletion will not take        effect.    -   9. The system logs a fcreate (4709) when it creates a file. This        log entry comprises a file name filename (4729) which is the        name that the file will be known as when it is operated on in        the future, an iname, iname (4730) which is the name of the        underlying file in the file system, an integer mode mode (4736)        which indicates the permissions that the file is created on        (including, but not limited to, whether the file's owner can        read or write the file and whether other users can read or write        the file), an integer flags flags (4727), an integer        descriptor_version (4737), and a byte string descriptor (4738).    -   10. The system logs a fopen (4710) when it opens a file. This        log entry comprises a file number filenum (4726) which is used        when referring to the file in other log entries, an integer        flags (4727) to record information about the file for example        whether the dictionary contained in the file allows duplicate        keys, and a file name filename (4729) which names the file being        opened.    -   11. The system logs a fclose (4711) when it closes a file. This        log entry comprises filenum (4726), flags (4727), and filename        (4729), similarly to the log entry for fopen (4710). When        traversing the log backwards during recovery the system uses the        flags (4727) and filename (4729) to open the file.    -   12. The system logs a emptytablelock (4712) when it locks a        table for a transaction, in the case where the table was created        by the transaction, or the table was empty when the transaction        began. This log entry comprises a transaction_id (2812) and a        file number filenum (4726).    -   13. The system logs a pushinsert (4713) when it inserts a        key-value pair into a dictionary and if there is a previous        matching key-value pair, then the new key-value pair is to        overwrite the old one. This record comprises a file number        filenum (4726) indicating the dictionary into which pair is        being inserted, transaction_id (2812) indicating the transaction        that is inserting a pair, and the pair compromising key (2810)        and value (2811).    -   14. The system logs a pushinsertnooverwrite (4714) when it        inserts a key-value pair into a dictionary when, if there is a        previous matching key-value pair, the new pair should not        replace the old one. The fields are similar to those of        pushinsert (4713).    -   15. The system logs a pushdeleteboth (4715) when it deletes a        key-value pair from a dictionary, where the system is deleting        any key-value that matches both the key and the value. If no        such pairs match, then the deletion has no effect. This log        entry comprises a filenum (4726), transaction_id (2812), key        (2810), and value (2811).    -   16. The system logs a pushdeleteany (4716) it deletes a        key-value pair from a dictionary, where the system is deleting        any key-value that matches the key. For dictionaries with        duplicates, this can result in deleting several pairs if several        pairs match. If there are no such pairs, then deletion has no        effect. This log entry comprises a filenum (4726),        transaction_id (2812), and key (2810).    -   17. The system logs a pushinsertmultiple (4717) when it inserts        key-value pairs into one or more dictionaries, where there is a        master key-value pair that can be used to compute the key-value        pair to be inserted into each corresponding dictionary. For        example if one dictionary is indexed by first-name and then        last-name, and another dictionary is indexed by last-name and        then first-name, then the master record might contain both        names, and the pairs to be inserted into the respective        dictionaries can be derived from the master record. The system        uses a descriptor, descriptor (4738), to encode how the derived        pairs are computed. This log entry comprises a file number,        filenum (4726), which identifies a master dictionary, and an        sequence of file numbers, filenums (4731), which respectively        identify a sequence of derived dictionaries. This log entry        further comprises an XID, transaction_id (2812), and the master        key-value pair comprising key (2810) and value (2811).    -   18. The system logs a pushdeletemultiple (4718) when it deletes        key-value pairs from one or more dictionaries, in a situation        similar to that used by pushinsertmultiple (4717). Deletion from        several dictionaries can be specified with a single master        key-value pair. This log entry comprises fields filenum (4726),        transaction_id (2812), key (2810), value (2811), and filenums        (4731).    -   19. The system logs a comment (4719) when the system writes an        byte string to the log, for example to note that the system        rebooted at a particular time. Typically the byte string has        meaning for the humans who maintain the system, but that is not        required. The system also records this type of log entry to        align the log end (for example, to a 4096-byte boundary),        choosing the comment length to force the desired alignment. This        log entry comprises a time stamp timestamp (4724) and a comment        comment (4732), which is byte string.    -   20. The system logs a load (4720) in some situations when the        system performs a bulk load from a data file, including but not        limited to files in which rows comprise comma-separated values.        In these situations, the system constructs a new dictionary        file, and then replaces an old dictionary file with the new one.        The system starts with the old dictionary file, and it        constructs a new dictionary file without modifying the old one.        If the transaction enclosing the bulk load commits, the old file        is deleted. If the transaction does not commit then the system        deletes the new file. As part of the load, the system inserts a        modified record into the iname-dname dictionary, which is        committed or aborted similarly to any other dictionary        insertion. Thus, when a transaction commits, the iname-dname        dictionary refers to the new file, and the old file is deleted,        and when a transaction aborts, the iname-dname dictionary refers        to the old file, and the new file is deleted.        -   The load (4720) log entry comprises a timestamp (4724) which            notes the time at which the load was performed, a filenum            (4726) which notes which dictionary is being updated,            transaction_id (2812), and two file names oldfname (4733),            and newf name (4734), which specify the old and the new file            names respectively.

The system also records other log entries, at certain times, for examplelogging dictionary headers or writing an entire dictionary node into thelog.

Alternatively, other encodings of the log can be used. For example, thelength field could be omitted, since in principle one could scan the logfrom beginning to end to find the beginning of every log entry.Alternatively, the length of a log entry may be computable from otherinformation found in the log.

The log data is compressed when written to the log. The compression isperformed on one or more log entries together. The system assembles anin-RAM array of a sequence of log entries, then compresses them into ablock. The compressed block is written to disk, as

-   -   1. the length of the compressed block, and    -   2. the length of the uncompressed data in the block,    -   3. the LSN of the first entry in the block,    -   4. the LSN of the last entry in the block,    -   5. a Boolean indicating whether there is a checkpoint_begin        (4701) log entry in the block,    -   6. a Boolean indicating whether there is a checkpoint_end (4702)        log entry in the block,    -   7. a Boolean indicating that the compression table was reset        when compressing the block,    -   8. the compressed bytes, and    -   9. the compressed length again.        The log file itself further comprises a header that indicates        that the file is a log file, which may help the system avoid        treating a log as though it were a data file (similarly the data        files also have a header which may help prevent such confusion).        The compressed length at the end of the block can help the        system read log files backward, by starting at the end of a log        file, reading the compressed length, and then skipping back to        the beginning of the log. The system employs the Booleans that        indicate whether there are checkpoint records in the block to        find checkpoint records during recovery without examining or        uncompressing blocks that have no checkpoint record.

The system uses a compression library that constructs a table as itcompresses data. The table initially starts out empty, and as more datais compressed, the table grows. The system, when compressing severalblocks to the log, does not always reset the table between compressingblocks. The table-reset Boolean indicates whether the system startedwith a new table when compressing a block, or whether it used thepreviously accumulated table. The first compressed block in a file hasthe table-reset Boolean set to TRUE.

To decompress a compressed block of log sequence entries, the systemstarts at the compressed block, and checks to see if the table-resetBoolean is TRUE. If not, the system skips backward until it finds acompressed block that has a TRUE table-reset Boolean. Then itdecompresses the blocks, scanning forward, decompressing each blockuntil the decompressed block is found. The system maintains a cache ofrecently decompressed blocks.

Certain operations, including but not limited to committing atransaction that has no parent, comprise logging entries into the logand then synchronizing the log to disk using the fsync system call. Thesystem implements such operations by writing the log entries to anin-RAM data structure, possibly appending them to some previous logentries, compressing the block, and writing the compressed block todisk, and then calling fsync. In some conditions, the system resets thecompression table, and in some conditions it does not. For example, ifthe compression block ends up a the beginning of a log file, the systemresets the table. If more than one million bytes of data have beencompressed since the table was reset, the system resets the table.

If the in-RAM data structure exceeds a certain size, the systemcompresses the data and writes it to the log file as a block. Dependingon the situation, the system may or may not perform an fsync or acompression table reset.

The system maintains a count of how much compressed data has beenwritten to a log file. After a fixed number of compressed bytes havebeen written, the system resets the compression table at the next timethat a block is compressed.

The system maintains two in-RAM log buffers. At any given time, one ofthe log buffers is available to write log entries into. The other logbuffer can be idle or busy. When a thread creates a log entry, itappends the log entry into the available log buffer. To write orsynchronize the log to disk, a thread waits until the other log bufferis idle. At that point, there may be several threads waiting on thenewly idle buffer. One of the threads atomically

-   -   1. sets the available buffer to be busy,    -   2. sets the idle buffer to be available, and    -   3. resets the newly available buffer so that it is empty.        and then proceeds to compress the busy buffer, write it to disk,        and call fsync if necessary. The other threads that were waiting        for that buffer to become idle all start waiting until the fsync        has complete, at which point their log entries having been        written to disk, they continue.

In some cases the available log buffer becomes so full that the systemforces threads to wait before appending their log entries to disk.

In some conditions commits several transactions with a single call tofsync.

When the system performs a checkpoint, the system, for each dictionary,

-   -   1. saves all the dirty blocks of the dictionary to disk, not        overwriting blocks saved at last checkpoint,    -   2. records their locations on the disk in a new BTT,    -   3. saves the new BTT on disk, not overwriting the BTT saved at        last checkpoint,    -   4. saves a new header that points to the new BTT, not        overwriting the header saved at last checkpoint, and    -   5. writes other relevant information in the log.        One thread can perform a checkpoint even when other threads are        running concurrently by performing the following steps:    -   1. Write a checkpoint_begin (4701) record.    -   2. Obtain a lock on the buffer pool.    -   3. For each pair in the buffer pool, if the pair is dirty, then        set its checkpoint_pending Boolean to TRUE and add the pair to a        list of pending pairs, otherwise set its pending flag to FALSE.    -   4. For each open dictionary,        -   (a) copy the dictionary's BTT to a temporary BTT (the TBTT),        -   (b) copy the dictionary's header to a temporary header, and        -   (c) log a the association of the file to its file number            using a fassociate (4703) log entry.    -   5. For each transaction that is currently open and has no        parent, log the fact that the transaction is open using a        txnisopen (4704) log entry.    -   6. Release the lock on the buffer pool.    -   7. Establish a work queue.    -   8. For each pair in the list of pending pairs:        -   (a) Wait until the work queue is not overfull.        -   (b) Obtain the lock on the buffer pool.        -   (c) If the pair's checkpoint_pending is TRUE then schedule            the node to be written to disk by putting the node into a            work queue. (The checkpoint_pending could be FALSE because,            for example, another thread could have performed a getandpin            operation, which would have caused the pending pair to be            processed at that time.) The system updates the TBTT as well            as the BTT when writing a node to disk.        -   (d) Release the lock on the buffer pool.    -   9. Wait for all the writes to complete.    -   10. For each open dictionary        -   (a) Allocate a segment for the dictionary's TBTT, and write            it to disk.        -   (b) Set the temporary header's BTT to point at the newly            allocated TBTT, and write the temporary header to disk.    -   11. Synchronize the disk-resident data to disk using the fsync        function.    -   12. Write the checkpoint_end (4702) to the log.    -   13. Synchronize the log to disk using the the fsync function.

2545 The system frees segments when they are no longer in use. A segmentis given to the dictionary's segment allocator (3201) for deallocationwhen the segment is not used in the BTT, the CBTT, or in a TBTT, andwhen the segment is not used to hold the on-disk representation in theheader, the checkpointed-header, or the temporary header. The system candetermine a segment is no longer in use when it writes a block asfollows:

-   -   1. When writing a block number for the BTT, the system allocates        a new segment.    -   2. If the old segment is not used for the same block number in        either the CBTT or a TBTT, then the segment can be added to a        list of segments to deallocate.        If the system is writing a block for a checkpoint, then it        updates both the TBTT and the BTT. In this case the old segments        identified in both TBTT and the BTT can each be added to the        list of segments to deallocate if each respective old segment is        not used in the CBTT.

When a checkpoint completes, the TBTT becomes the CBTT, and the segmentsin the old CBTT are candidates for deallocation. The system, for eachtranslated block number, examines the old CBTT, the TBTT, and the BTT tosee if the corresponding segment is no longer in use. If so, then it addthat segment to the list of segments to deallocate.

Alternatively, the system could a node to the log when a node ismodified for the first time after a checkpoint. If the underlying datafiles are copied to a backup system, and then the log files are copiedto a backup system, the system could use those copied files to restorethe dictionaries to a consistent state.

The system maintains two copies of the dictionary header and two copiesof the block translation table. The system maintains the two copies insuch a way that they are distant from each other on disk or on separatedisks. The system maintains the LSN on each header as well as a checksumon each header.

In a quiescent state, the system has written both copies of the headersas with same LSN, the same data, and with correct checksums. Whenupdating the header on disk, the system first checks to see if there aretwo good headers that have the same LSN (that is, whether the system isin a quiescent state). If they both exist, then the system

-   -   1. overwrites one header,    -   2. synchronizes the disk with the fsync( ) system call, and then    -   3. overwrites the other header.        If two good headers exist but they have different LSNs, then the        system    -   1. overwrites the older header,    -   2. synchronizes the disk, and then    -   3. overwrites the newer header.        If only one header is good, then the system    -   1. overwrites the bad header,    -   2. synchronizes the disk, and then    -   3. overwrites the other header.        This sequence of steps is called a careful header write.

When opening a dictionary for access, the system reads the two headers,selecting the good one if there is only one good header, and selectingthe newer one if there are two good headers. If neither header is goodthen the system performs disaster recovery, obtaining a previouslybacked-up copy of the database and reapplying any operations that havebeen logged in a logging file.

Thus, the system has the option of selecting a header from the log, orcan retrieve a header from one of the two copies stored on disk.

Alternatively, the details of the disk synchronization and writes can bechanged. For example, in some situations it suffices to perform acareful header write and not write a copy of the header to the log. Insome situations it suffices to write the header to the log and notmaintain two copies of the header on disk. Another alternative is towrite segments to the log device instead of to the disk, so that thesnapshot is distributed through the log. Another alternative is to takea “fuzzy snapshot” in which the segments are saved to disk at differenttimes, and enough information is stored in the log to bring the segmentsinto a consistent state.

To start the system after a crash the system reads the log backwards tofind the most recent checkpoint_end (4702) log entry. That log entryincludes the LSN of checkpoint_begin (4701) entry that was performed atthe beginning of the checkpoint. When a header is being read from adictionary, if there are two good headers, the system chooses the headerthat has the LSN matching the beginning of the checkpoint.

When recovering from a crash, the system maintains a state variable,illustrated in FIG. 52. The state variable is changed as log entries areprocessed during recovery. In FIG. 52 the state variable is shownchanging as an arc with a label.

When recovering from a crash, the system performs the followingoperations:

-   -   1. Acquire a file lock (for example, using the flock( ) system        call on Linux and FreeBSD, and a fcntl in Solaris, and a_s open        with locking arguments in Windows).    -   2. Delete all the rolltmp files.    -   3. Determine whether recovery is needed. If there are no log        files or there is a “clean” checkpoint (that had no open        transactions while running), at the end of the log file, then        recovery is not needed.    -   4. Create an environment for recovery (creating a buffer pool,        and initializing the default row comparison and row generation        functions.    -   5. Write a message to the error log indicating the time that        recovery began.    -   6. Find the last log entry in the log. The system skips empty        log files during recovery, and if there is a partial log entry        at the end of the last log file, the system skips that. There        are many reasons why a log file might be empty or a log entry        might be incomplete, including but not limited to the disk        having been full when the log entry was being written.    -   7. Scan backward from the last log entry, for each log entry        encountered, do the following operation depending on the log        entry:        -   (a) checkpoint_begin (4701):            -   i. If the system is in the BBCBE (5202) state, then if                there were no live transactions recorded, then go to the                FOCB (5204) state and start scanning forward, otherwise                go to the BOCB (5203) state. The system prints an error                log message indicating that recovery is scanning                forward.            -   ii. Otherwise continue.        -   (b) checkpoint_end (4702): If the system is in the BNCE            (5201) state, then go to the BBCBE (5202) state and record            the XID of the checkpoint (that is the LSN of the            corresponding checkpoint_begin (4701)).        -   (c) fassociate (4703): If the system is in the BBCBE (5202)            state then open the file.        -   (d) txnisopen (4704): If the system is in the BBCBE (5202)            state then increment the number of live transactions, and if            the XID is less than any previously seen one (or if there is            no previously seen one) then remember the XID.        -   (e) committxn (4705): Continue.        -   (f) aborttxn (4706): Continue.        -   (g) begintxn (4707): If the system is in the BOCB (5203)            state and the XID of this log entry is equal to the oldest            transaction mentioned in a txnisopen (4704) log entry in the            BBCBE (5202) state then go to the FOCB (5204) state and            start scanning forward        -   (h) fdelete (4708): Continue.        -   (i) fcreate (4709): Close the file if it is open.        -   (j) fopen (4710): Close the file if it is open.        -   (k) fclose (4711): Continue.        -   (l) emptytablelock (4712): Continue.        -   (m) pushinsert (4713): Continue.        -   (n) pushinsertnooverwrite (4714): Continue.        -   (o) pushdeleteboth (4715): Continue.        -   (p) pushdeleteany (4716): Continue.        -   (q) pushinsertmultiple (4717): Continue.        -   (r) pushdeletemultiple (4718): Continue.        -   (s) comment (4719): Continue.        -   (t) load (4720): Continue.        -   (u) txndict (4721): Merge the log entries from the            identified dictionary into the recovery logs, and process            them.    -   8. Scan forward from the point identified above. For each log        entry encountered, do the following operation depending on the        log entry:        -   (a) checkpoint_begin (4701): If the system is in the FOCB            (5204) state then go to the FBCBE (5205) state.        -   (b) checkpoint_end (4702): If the system is in the FBCBE            (5205) state then go to the FNCE (5206) state.        -   (c) fassociate (4703): Continue.        -   (d) txnisopen (4704): Continue.        -   (e) committxn (4705): If the transaction is open, then            execute the commit actions for the transaction, and destroy            the transaction.        -   (f) aborttxn (4706): If the transaction is open, then            execute the abort actions for the transaction, and destroy            the transaction.        -   (g) begintxn (4707): Create a transaction.        -   (h) fdelete (4708): If the file exists and the identified            transaction is active, then create a commit action that will            delete the file when the transaction commits.        -   (i) fcreate (4709): If the system is in not in the FOCB            (5204) state then unlink the underlying file from the file            system (if the file exists) and create a new one, updating            the iname-dname dictionary.        -   (j) fopen (4710): Open the file.        -   (k) fclose (4711): If the file is open, then close it.        -   (l) emptytablelock (4712): If the file is open, then obtain            a table lock on the file.        -   (m) pushinsert (4713), pushinsertnooverwrite (4714),            pushdeleteboth (4715), pushdeleteany (4716): If the            transaction exists and the file is open then perform the            identified insertion or deletion as follows. Establish            commit and abort actions for the operation. If the LSN of            the dictionary is older than the LSN of this log entry, then            push the operation's message into the dictionary.        -   (n) pushinsertmultiple (4717), pushdeletemultiple (4718): If            the transaction exists then generate each required row and            for each generated row perform the actions that would have            been done if that row were found in a pushinsert (4713) or            pushdeleteany (4716) message.        -   (o) comment (4719): Continue.        -   (p) load (4720): If the transaction exists, then establish a            commit action to delete the old file, and an abort            transaction to delete the new file.        -   (q) txndict (4721):    -   9. Clean up the recovery environment by closing the        dictionaries.    -   10. Release the file lock.        When the end of the log has been reached, the system performs a        checkpoint, and has recovered from a crash.

Once every 1000 log entries, the system prints a status message to theerror log indicating progress scanning backward or forward.

The list of segments to deallocate is maintained until the data file issynchronized to disk with an fsync, after which the system deallocatesunneeded segments and the disk space is used again.

A segment is kept if any of the following

1. the new segment has not been written to disk,

2. the BTT has not updated on disk, or

3. the segment is needed to represent some active version of thedictionary.

There may be other reasons to keep segments. For example, during backup,old segments are kept in an allocated state until the backup completes.

The system trims unneeded log files by deleting the files that are nolonger needed. A log file is needed if

-   -   1. the log file contains the checkpoint_begin (4701)        corresponding to the most recently logged checkpoint_end (4702),    -   2. some uncompleted transaction has a log entry in the log, or    -   3. an older log file is needed.        There may be other reasons that a log file is needed. For        example, during backup, all the log entries that existed at the        beginning of the backup are kept until the backup completes.        After a log file is deleted, the system can reuse the storage        space for other purposes, including but not limited to writing        more log files or writing dictionary data files.

In one mode of operation, the system, for each dictionary modified by atransaction, allocates a segment in the dictionary. Log entries thatmention a file number are logged in the segment of the dictionarycorresponding to the file number instead of in the log. An additionaltxndict (4721) log entry is recorded after the checkpoint_begin (4701)and before the checkpoint_end (4702) to note the existence of thissegment. The txndict (4721) entry records the XID of the relevanttransaction in transaction_id (2812), the filenum (4726) which denoteswhich file contains the segment, the blocknum (404) which denotes whichblock contains the segment, the block number being translated using theBTT to identify where in the file the segment is stored. In this mode,all information needed for recovery can be found in log entriessubsequent to the checkpoint_begin (4701) corresponding to the mostrecent checkpoint_end (4702).

Lock Tree

The system employs a data structure called a lock tree to provideisolation between different transactions. The lock tree implementsrow-level locks on single rows and ranges of rows in each dictionary. Alock is said to cover a row if the lock is a lock on that row or on arange that includes that row. In some situations, the system employsexclusive locks, and in some situations the system employsreader-writers locks. In the system, only one transaction can hold awriter lock that covers a particular row, and if there is such atransaction, then no reader locks may be held that cover that row.Multiple reader locks may be held by different transactions on the samerow at the same time.

Transactions read and write key-data pairs. For the purpose of locking,we refer here to those key-data pairs as points. For a DUP database, apoint can be identified by a key-value pair. For a NODUP database, thekey alone is enough to identify a point. In either case, a pointcorresponds to a single pair in the dictionary. The locking systemdefines two special points, called ‘∞’ and ‘−∞’. These two specialpoints are values that are not seen by the user of the locking system.Points can be compared by a user-defined comparison function, which isthe same function used to compare pairs in the dictionary.

A transaction t holds a lock on zero, one, or more points. For example,when providing serializable isolation semantics, if a transactionperforms a query, and the transaction doesn't change any rows, then thetransaction can perform the same query again and get the same answer. Inone mode of operation, the transaction acquires reader locks on at leastall the rows it reads so that another transaction cannot change any ofthose rows.

For example, in some isolation modes, if a transaction performs a queryto “retrieve the smallest element of a dictionary” and obtains P, thesystem acquires a reader lock on the range [−∞, P], even though thequery only actually read P. This prevents a separate transaction frominsert pointing P₂<P before the first transaction finishes, violatingthe isolation property, because if the first transaction were to askagain for the smallest element, it would get P₂ instead of P.

As this example indicates, a transaction acquires locks on ranges ofpoints. In this document, when we say “range,” we mean a closedinterval. A range of points is a set identified by its endpoints x andy, where the x<y. When x=y, the set is of cardinality one. Otherwise,the set may contain one or more finite or infinite values. The systemtreats both −∞ and ∞ as possible endpoints of ranges.

For each transaction and each database, the lock tree maintains a set ofclosed ranges that have been read, the read set and a set of points(which are 1-point ranges) that have been written, the write set. Rangesin the read set represent both points that have been read, and thosethat needed to be locked to ensure proper isolation.

In some situations, the system escalates locks, so the write set cansometimes contain ranges that are not single points. If a transactionholds locks on two ranges [a, b] and [c, d], where a<b<c<d, and no othertransaction holds conflicting locks in the range [a, d], the system mayreplace the two ranges with the larger range [a, d]. The system mayescalate locks in this way in order to save memory, or for otherreasons, including but not limited to speeding up operations on thelocks.

The lock tree can determine if the read set of one transactionintersects the write set of another transaction, and if the write set oftwo transactions intersect. If there are any such intersections, thenthe lock tree is conflicting. The lock tree operates as follows:

-   -   1. The system attempts to add a set of points to a read or write        set. The added set can be either a single point added to the        write set or a closed range added to the read set.    -   2. If the resulting lock tree would be conflicting, the set is        not added. Instead an error is returned. If the resulting lock        tree is not in conflict, then the lock tree is updated and the        addition is successful.

When a transaction completes, it releases all the locks it holds.

A lock tree comprises a set of range trees. There may be zero, one, ormore range trees.

A range tree maintains a set of ranges, and for each range, anassociated data value. Specifically, a range tree S maintains a finiteset of distinct pairs of the following form:

I, T

where I=[L, H] is a closed range of points which are locked, and T isthe associated data item. In this system, T is the XID of thetransaction that has locked the range.

The system categorizes range trees into four groups: range trees areconsidered either overlapping or non-overlapping. Independently, rangetrees are considered homogeneous or heterogeneous.

In a non-overlapping range tree, the ranges do not overlap.

Ranges in an overlapping range tree sometimes overlap.

Ranges in a homogeneous range tree have the same associated data item.The system uses homogeneous range trees to store ranges all locked bythe same transaction.

Ranges in a heterogeneous range tree may store the same or differentassociated data items for different ranges. The system usesheterogeneous range trees to store ranges that can be locked by multipletransactions.

The system can perform the following operations on range trees:

-   -   1. FINDALLOVERLAPS (S, I) returns all pairs in S that overlap a        given range I.    -   2. FINDOVERLAPS (S,I,k) returns all K pairs from range tree S        whose ranges overlap range I, unless K>k. If so, the function        returns only k of these pairs, arbitrarily chosen.    -   3. INSERT (S,I,T) inserts a new pair        I, T        into range tree S, modifying S. A non-overlapping range tree        does not allow an insert that causes an overlap, returning an        error instead. Similarly a homogeneous range tree does not allow        an insert with an associated data item that is different from        any others already in S.    -   4. DELETE (S,I,T) removes range I with associated data item T        from S if such a pair exists.

Non-overlapping ranges can be ordered, which therefore induces a totalorder on pairs in a non-overlapping range tree. The system defines [a,b]<[c, d] if an only if b<c. This ordering function also defines apartial order on arbitrary ranges, even those that overlap.

There is a partial order on points and ranges. The system defines a<[b,c] if and only if a<b, and [b, c]<a if and only if c<a.

The system performs the following additional operations onnon-overlapping range trees:

-   -   1. PREDECESSOR (S,P) returns the greatest        I, T        in range tree S, where range I<P, or else returns “not found” if        no such pair exists.    -   2. SUCCESSOR (S,P) returns the least        I, T        in range tree S, where range P<I, or “not found” if no such pair        exists.

The non-overlapping range tree can be implemented using a search datastructure, which includes but is not limited to an OMT, a red-blacktree, an AVL tree, or a PMA. Non-overlapping range trees can also beimplemented using other data structures including but not limited tosorted arrays or non-balanced search trees.

In the search tree, the system stores the endpoints of all ranges, andan indication on each endpoint whether it is a right or a left endpoint.

The overlapping range tree can also be implemented using a search tree,where some additional information is stored in the internal nodes of thetree. The system stores the intervals in a binary search tree, orderedby left endpoint. In every node in the tree, the system stores the valueof the maximum right endpoint stored in the subtree rooted at that node.

For the purpose of the lock tree, each database is handledindependently, so we can describe the representation as though there isonly one database.

The system employs a collection of zero or more range trees to representa lock tree. The ranges represent regions of key space or key-valuespace that are locked by a transaction.

The lock tree comprises,

-   -   1. For each pending transaction t        -   (a) a LOCALREADSET range tree R_(t), and        -   (b) a LOCALWRITESET range tree W_(t);    -   2. a GLOBALREADSET range tree GR; and    -   3. a BORDERWRITE range tree B.

Each R_(t) comprises a homogeneous non-overlapping range tree. Thesystem employs R_(t) to maintain the read set for transaction t. Thepresence of a pair

[x,y],t

εR_(t) means that transaction t holds a read lock on range [x,y].

Each W_(t) comprises a homogeneous non-overlapping range tree. Thesystem employs W_(t) to maintain the write set for transaction t. Thepresence of a pair

[x, y], t

εW_(t) means that transaction t holds a write lock on range [x,y].

GR comprises a heterogeneous overlapping range tree that maintains theunion of all read sets. The system employs range tree GR to containinformation that can, in principle, be calculated from the LOCALREADSETtrees. A pair

[x, y], t

εGR means that transaction t holds a read lock on range [x,y].

B comprises a heterogeneous non-overlapping range tree. The systememploys B to hold maximal ranges of the form

[x,y], T

. The system stores

[x, y], T

in B when following conditions hold:

-   -   1. Transaction t holds locks on points x and y. All points in        the range [x, y] are either locked by transaction t or are        unlocked.    -   2. The largest locked point less than x (if one exists) and the        smallest locked point greater than y (if one exists), are locked        by transactions other than t.        In principle, all the information in the BORDERWRITE tree can be        calculated from the LOCALWRITESET trees.

The system performs range consolidation on some insertions, meaning thatwhen a transaction T locks two overlapping ranges X and Y, the systemreplaces those two ranges with a single combined range X∪Y. If rangesare consolidated then all distinct ranges stored in a range tree for thesame transaction are nonoverlapping.

Range consolidation is implemented in a homogeneous range tree asfollows. Before (I,T) is inserted into a homogeneous range tree S, thesystem uses FINDOVERLAPS (S,I,k=∞) query, which returns all ranges thatoverlap with the new range I. The system deletes those ranges from therange tree, and then creates a new range that is a union of all theseranges including I and inserts this new range into the lock tree.

In a heterogeneous range tree, range consolidation is similar, exceptthat the system checks that only ranges corresponding to the same T areconsolidated. One way to maintain range consolidation on a heterogeneousrange tree, is to maintain separate (homogeneous) range trees for eachassociated T. The system uses GR in this fashion. The system identifieswhich intervals to consolidate in the heterogeneous range tree, GR, byfirst doing range consolidation on the homogeneous range tree R_(T).

As an example, consider range tree S={

[0, 1], t

,

[2, 4], t

}. If the

[1, 3], t

is added, then, after range consolidation, the range tree stores S={

[0, 4], t

}.

We say that an interval I (or a point P) meets a range tree if one ofthe intervals stored in the range tree overlaps I (or P).

We say that an interval I (or point P) meets a range tree at T if I (orP) overlaps an interval in the range tree associated with T.

We say that an interval I (or point P) is dominated by a range tree ifthe interval T is entirely contained in one of the intervals stored inthe range tree.

As an example, consider [0,5] and range tree {

[−6, 5], T₁

,

[4, 6], T₂

,

[7, 10], T₃

}. Interval [0, 5] meets the range tree. Specifically, [0, 5] meets therange tree at T₁ and meets the range tree at T₂, but does not meet therange tree at T₃. Interval [0, 5] is also dominated by this range tree,because [0, 5] is entirely contained in [−6, 5].

The system employs the lock tree to answer queries about whether aninterval I meets or is dominated by a range tree and at whattransaction. The system implements those queries using procedureFINDOVERLAPS, either with k=1 or k=2. Examples include, but are notlimited to, the following queries:

-   -   1. Does an interval I meet a range tree S?        -   The system uses a FINDOVERLAPS query with k=1.    -   2. Given a point P, a transaction T, and a range tree S, does        the point P meet the range tree S at a transaction different        from T?        -   The system uses a FINDOVERLAPS query with k=2.    -   3. Given an interval I, a transaction T, and a range tree S,        does more than one interval in S overlap I? If so, return “more        than one overlap.” Otherwise, if exactly one interval overlaps,        and its associated transaction is different from T, then return        the name T′ of that transaction. Otherwise return “ok”.        -   The system performs this three-way test using a FIND            OVERLAPS query with k=2.    -   4. Given an interval I and a range tree S, does S dominate I?        -   The system uses a FIND OVERLAPS query with k=2 taking            advantage of range consolidation.

In more detail, the lock tree operates as follows.

-   -   1. For transaction T to acquire a read lock on a closed range I:        -   (a) If I is dominated by W_(T) then return success.        -   (b) Else if I is dominated by R_(T) then return success.        -   (c) Else if I meets the border write tree B at a transaction            T₂≠T and I meets the write W_(T) ₂ then return failure.        -   (d) Else insert            I, t            into GR and into R_(t), consolidate ranges if necessary, and            return success.    -   2. For transaction T to acquire a write lock on point P:        -   (a) If P is dominated by W_(T) then return success.        -   (b) Else if P meets GR at transaction T₂ T then return            failure.        -   (c) Else if P meets B at T₂ T and P meets W_(T) ₂ then            return failure.        -   (d) Else insert            [P,P],T            to W_(T) and consolidate ranges if necessary. Then update            the BORDERWRITE tree B to include            [P,P],T            and return success.    -   3. For transaction T to release all of its locks (which happens        when the transaction commits or aborts):        -   (a) Release the read set.            -   i. For each range I e R_(T):                -   A. Delete                    I, T                    from the GLOBALREAD SET tree GR.            -   ii. Delete the entire LOCALREADSET tree W_(T) for                transaction T.        -   (b) Release the write set.            -   i. For each range IεLOCALWRITESET W_(T):                -   A. If I meets BORDERWRITE tree B at T then update                    the BORDERWRITE tree B to exclude                    I,T                    .            -   ii. Delete the entire LOCALWRITESET tree R_(T) for                transaction T.

To update the BORDERWRITE tree B to include

I=[P,P],T

:

-   -   1. Run a FINDOVERLAPS (B, I, k=1) query to retrieve set F.        Either F={        I_(F),T_(F)        } or F=.    -   2. If |F|=1        T_(F)=T then return success.    -   3. Else if |F|=1        T_(F)≠T then:        -   (a) Remove the overlapping range from B: DELETE (B, I_(F),            T_(F)).        -   (b) Split I_(F) into two ranges for transaction T_(F) as:            -   i. Run STRICTSUCCESSOR (W_(F), P) to retrieve                I_(S),f                .            -   ii. Run STRICTPREDECESSOR (W_(F), P) to retrieve                I_(P),f                .            -   iii. Insert the lower end of the split range into the B                as INSERT (B,[I_(f) _(L) ,I_(P) _(H) ], f).            -   iv. Insert the upper end of the split range into the B                as INSERT (B,[I_(S) _(L) ,I_(f) _(H) ], f)        -   (c) Insert the new range into the BORDERWRITE tree as INSERT            (B,I,t).        -   (d) Return success.    -   4. Else (|F|=0) then:        -   (a) Extend I if necessary:            -   i. Run STRICTSUCCESSOR (B,P) to retrieve                I_(S),t₂                .            -   ii. If a successor is found and t=t₂ then extend I to                include I_(S) in the BORDERWRITE table:                -   A. Remove the successor range from B as DELETE                    (B,I_(S),t).                -   B. Insert an extended range to cover both I and                    I_(S) as INSERT (B,[P,I_(S) _(H) ],t).                -   C. Return success.            -   iii. Run STRICTPREDECESSOR (B,P) to retrieve                I_(P),t₃                .            -   iv. If a predecessor is found and t=t₃ then extend I to                include I_(P) in the BORDERWRITE table:                -   A. Remove the predecessor range from B as DELETE                    (B,I_(P),t).                -   B. Insert an extended range to cover both I and                    I_(P) as INSERT (B,[I_(P) _(L) ,P],t)                -   C. Return success.        -   (b) Insert I into B as INSERT (B,I,T).        -   (c) Return success.

To update the BORDERWRITE tree B to exclude

[P, P],t

:

-   -   1. Let I=[P,P].    -   2. Run a FINDOVERLAPS (B,I,k=1) query to retrieve set F where        either F={        I_(F),t_(F)        } or F=.    -   3. If |F|=0 return success.    -   4. Else if |F|=1        t_(F)≠t then return success.    -   5. Else (|F|=1        t_(F)=t), so:        -   (a) Remove the overlapping range from B as DELETE (B,I,t).        -   (b) Run STRICTSUCCESSOR (B,P) to retrieve            I_(S),t₂            .        -   (c) Run STRICTPREDECESSOR (B,P) to retrieve            I_(P),t₃            .        -   (d) If a predecessor is found and a successor is found and            t₂=t₃ then merge I_(S), I_(P) and the set of points between            them as:            -   i. Remove the successor range from B as DELETE                (B,I_(S),t₂).            -   ii. Remove the predecessor range from B as DELETE                (B,I_(P),t₂).            -   iii. Insert the extended range as INSERT (B, [I_(P) _(L)                , I_(S) _(H) ],t₂).        -   (e) Return success.

The system escalates locks when running short on memory to hold the locktable. To escalate locks, the system finds one or more adjacent rangesfrom the same transaction, and merges them. If no such ranges can befound, then the system allocates more memory to the lock table, and mayremove memory allocated to other data structures including but notlimited to the buffer pool.

To implement serializable transactions:

-   -   1. When inserting a pair, the system obtains a write lock on the        pair. If the lock is obtained, the pair is inserted.    -   2. When looking up a pair, the system obtains a read lock on the        pair. If the lock is obtained, the pair is looked up in the        dictionary.    -   3. When querying to find the smallest pair q greater than or        equal to a particular pair p, the system performs the following:        -   (a) Search to find q.        -   (b) If no such value exists, then find the largest value r            in the dictionary, and lock the range [r, ∞]. If the lock            cannot be obtained, then the query fails. Otherwise the            query succeeds, and returning an indication that there is no            such value.        -   (c) If p=q then the system acquires a read lock on [q,q]. If            the lock is obtained, the query succeeds, else it fails.        -   (d) Else search to find the successor s of q. If such            successor exists, then lock [q, s], otherwise lock [q, ∞].            If the lock cannot be acquired, then the query fails,            otherwise the query succeeds.    -   4. When querying to find the successor q of p, where p has        already been returned by a previous search (for example in a        cursor NEXT operation), the system performs the following        -   (a) Search to find q. If no successor exists, then let q=∞.        -   (b) Lock the range [p,q].

The system also performs other queries, including but not limited tofinding the greatest pair less than or equal to a given value, andfinding the predecessor of a value.

Alternatively, instead of failing when a lock conflict is detected, thesystem could perform another action. For example, the system could retryseveral times, or the system could retry immediately, wait some time,retry again, wait a longer time, and retry again, eventually timing outand failing. Or the system could simply wait indefinitely for theconflicting lock to be released, in which case the system may employ adeadlock detection computation to kill one or more of the transactionsthat are deadlocked.

The system also provides other isolation levels. For example, toimplement a read-committed isolation level, the system acquires readlocks selected data but they are released immediately, whereas writelocks are released at the end of the transaction. For read uncommitted,read locks are not obtained at all. In another mode, the systemimplements read-committed isolation by reading the committed transactionrecord from a leaf entry (described below), and implementsread-uncommitted by reading the most deeply nested transaction recordfrom a leaf entry, in both cases without obtaining a read lock. Forrepeatable read isolation levels, instead of locking ranges, the systemcan lock only those points that are actually read. For snapshotisolation the system can keep multiple versions of each pair instead ofusing locks, and return the proper version of the pair in response to aquery.

Transaction Commit and Abort

When a transaction commits or aborts, the system performs cleanupoperations to finish the transaction. If a transaction commits, thecleanup operations cause the transactions change to take permanenteffect. If a transaction aborts, the system undoes the operations of thetransaction in a process called rollback.

The system implements these transaction-finishing operations bymaintaining a list of operations performed by the transaction. This listis called the rolltmp log.

For example, each time the system pushes an insert message (2801) intothe dictionary, it remembers that. If the transaction aborts, then anabort_both (2808) is inserted into the dictionary to cleanup. If thetransaction commits, then a commit_both (2806) is inserted.

For each operation, the system stores enough information in the rolltmplog so the proper cleanup operations can be performed on abort orcommit.

In the case where the system crashes before a transaction commits, thenduring recovery transactions are created and a rolltmp log is recreated.When recovery completes, if there are any incomplete transactions, thenrecovery aborts those transactions, executing the proper cleanup actionsfrom the rolltmp log.

Error Messages, Acknowledgments, and Feedback

The system can return acknowledgments and error messages depending onthe specific settings in the dictionary.

For example, the operations INSERT(k, v) or DELETE (k) in the NODUP casecan return an additional Boolean indicating whether, at the time thatthe command was issued, there already existed a key k in the database. Areturn value of FALSE means that the key k did not exist in the databaseat the time of the insert, and a return value of TRUE means that it did.For the INSERT(k, v) operation, depending on the operating mode, an oldvalue v′ can be overwritten by the value v or the insertion of value vcan be disallowed by the system because of the existence of old valuev′.

The operations INSERT(k, v′) or DELETE (k) in the DUP case can return anadditional Boolean indicating whether, at the time of the insertion,there already existed a key k in the database. For the operationINSERT(k, v), there can be an additional Boolean indicating whether thekey-value pair (k, v) already existed in the database. For the operationDELETE (k), the number of key-value pairs that were deleted can bereturned.

One way to determine a status Boolean is to perform an implicit searchwhen performing INSERT(k, v). That is, before starting INSERT(k, v),perform the operation Search(k) to determine whether k already appearsin the dictionary.

In another mode the system returns these status Booleans by filteringout some of the search operations by using a smaller dictionary, or anapproximate dictionary, that can fit within RAM, thus avoiding a fullSearch(k).

The system uses ten different filters that store information about whichkeys are in the streaming dictionary. Alternatively, the system coulduse a different number of filters.

The filter is implemented using a hash table. Denote the hash functionas h(x). Suppose that there are N keys. Then the filter stores Θ(N)bits, where the number of bits is always at least 2N. Then H[t]=1 if andonly if there exists a key k stored in the dictionary such that h(k)=t.

This filter exhibits one-sided error. That is, the filter may indicatethat a key k is stored in the dictionary when, in fact, it is not.However, if the filter indicates that a key k is not in the dictionary,then the key is absent. Each filter has a constant error probability.Suppose that the error probability is ½. Then the probability that all10 filters are in error is at most 2⁻¹⁰.

The total space consumption for all filters can be less than 32 bits perelement, which will often be more than one or two orders of magnitudesmaller than the total size of the dictionary.

Observe that in this specification uses a variation on the filter thatsupports deletions. One such variation is called a counting filter.

If for a given key all filters say that the key may be in thedictionary, then the system searches for it to determine whether it is.If one or more say that it is not in the dictionary, then the systemdoes not search for it. Even if a single filter of the ten indicatesthat a key k is not in the dictionary, then it is not necessary tosearch in the actual dictionary. Thus, the probability of searching inthe dictionary, when the key is not present, is approximately 2⁻¹⁰.

Thus, the cost to insert a new key not currently in the dictionary canbe reduced by an arbitrary amount by adding more RAM, to well below onedisk seek per insertion. The cost to insert a key already in thedictionary still involves a full search, and thus costs Ω(1) memorytransfers.

In some situations, the system makes all insertion operations givefeedback in o(1) memory transfers by storing cryptographic fingerprintsof the keys in a hash table. The data structure uses under 100 bits perkey, which is often orders of magnitude smaller than the size of thestreaming B-tree.

Refer now to FIG. 15 in which there is additional feedback provided tothe users upon updates. In this figure, there are four keys (1501) inthe tree (1502), a, baab, bb, and bbbba. The hash tables have 11 arraypositions and are indexed from 0 to 10. Two keys (1504) are to beinserted in the tree, aa and bba.

The first table (1505), T₁, has hash function h₁(x), which hashes thefour values in the tree as follows:

h ₁(a)=5

h ₁(baab)=9

h ₁(bb)=9

h ₁(bbbba)=1,

and hashes the two new values as follows:

h ₁(aa)=5

h ₁(bba)=9.

The second table (1506), T₂, has hash function h₂(x), which hashes thefour values in the tree as follows:

h ₂(a)=8

h ₂(baab)=0

h ₂(bb)=6

h ₂(bbbba)=3,

and hashes the two new values as follows:

h ₂(aa)=7

h ₂(bba)=8.

The last table (1507), T₁₀, has hash function h₁₀(x), which hashes thefour values in the tree as follows:

h ₁₀(a)=0

h ₁₀(baab)=9

h ₁₀(bb)=7

h ₁₀(bbbba)=5.

and hashes the two new values as follows:

h ₁₀(aa)=3

h ₁₀(bba)=9.

In all tables, hash marks indicate that an element is hashed to thatarray position (1508). Upon insertion of a key, the data structurereturns whether that keys already exists in the tree or not.

In this example, the two keys (1504) are to be inserted in the tree, aaand bba, and neither one already exists. Inserting aa does not require asearch in the tree because

T ₂ [h ₂(aa)]=T ₂[7]=0,

as shown at (1509), meaning that aa cannot already be stored in thedictionary. In contrast, to determine whether bba is in the dictionaryuses a search because for all i in the hash table T_(i)[h_(i)(bba)]=1 asshown at (1510).

Alternatively, other feedback messages can be returned to the user. Forexample, one could give feedback to the user that is approximate or hasa probability of error.

Alternatively, there are other parameter settings that can be chosen.For, example, the sizes and number of approximate dictionaries couldvary.

Alternatively, other compact dictionaries and approximate dictionariescan be used. For example, one can use other filter and hash-tablealternatives.

Alternatively, there are other ways to return error messages andacknowledgments to users without an immediate full search in many cases.For example, the feedback can be returned with some delay, for example,after inserted messages have reached the leaves. Another example is thatafter a load has completed, an explicit or implicit flush can beperformed—an implicit flush, say, by a range query—to ensure that allmessages have reached the leaves, and all acknowledgments or errormessages have been returned to the user.

Concurrent Streaming Dictionaries

The system provides support for concurrent operations. The system allowsone or more processes and/or processors to access the system's datastructures at the same time. Users of the system may configure thesystem with many disks, processors, memory, processes, and otherresources. In some cases the system can add these resources while thesystem is running.

The system employs when a message M(k,z) is added to the data structure,does not necessarily insert it into the root node u. Instead, M(k,z) isinserted into a deeper node v on M(k,z)'s root-to-leaf path, where v ispaged into RAM.

This “aggressive promotion” can mitigate or avoid a concentrated hotspot at the top of the tree. When a message M(k,z) is inserted into thedata structure, there is a choice of many first nodes in which to storeM(k,z). Moreover, the system's data structures automatically adapts tothe insertion and access patterns as the shape of the part of the treethat is stored in RAM changes.

Several examples help explain this adaptivity.

FIG. 16 depicts the case of updates from a uniform distribution ofrandom keys into a dictionary (1601). Suppose that there are concurrentinsertions of uniformly distributed random keys. The lines (1604) downthe tree represent paths that the keys are to take down the tree and thedots (1605) at the leaves of the tree represent the ultimate locationsfor the keys in the tree.

At a particular time some of the nodes in the tree that are closest tothe root are paged into memory. The part of the tree that is paged intomemory is indicated by hash marks (1602). In this figure the paged-inpart of the tree is nearly balanced.

Messages are inserted into the leaves (1603) of the part of the treethat is kept in main memory.

Refer now to FIG. 17, which depicts a skewed case of insertions into thedata structure (1701). Suppose that there are concurrent insertions atrandom locations in the leftmost 1% of the tree (1702), rather than inthe whole tree.

The top part of the tree that is paged into memory is be skewed towardsthe beginning of the database. This part of the tree is indicated byhash marks (1703). Thus, this top part of the tree will be deep onleftward branches and shallow on rightward branches, so that, again, thepaging system will adaptively diffuse what would otherwise be aninsertion hotspot. As before, the vertical lines (1704) emanating fromthe root represent insert paths in the tree and the locally deepestnodes paged into memory are represented by rectangles (1705). Themessages will be inserted into these locally deepest nodes 1705.

The system obtains a write lock on a node when it inserts data into anode, and so by inserting into different nodes, the system can reducecontention for the lock on a given node.

Alternatively, there are other ways to achieve concurrency throughadaptivity. For example, if a tree node is a hot spot, the system couldexplicitly choose to flush the buffers in the node and bring thechildren into RAM, if it reduces the contention on that node. Also, thesystem may choose to deal with a given node differently, depending onwhether it is clean or dirty.

Alternatively, there are other ways of using aggressive promotion tohelp achieve a highly concurrent dictionary. For example, one could useaggressive promotion for a non-tree-based streaming dictionary, such asa cache-oblivious lookahead array, to avoid insertion bottlenecks.

Alternatively, there are other ways of avoiding bottlenecks andachieving high concurrency. For example, one could use a type of datastructure with a graph structure having multiple entrances into thegraph, e.g., a tree with multiple roots or roots and some descendants ora modification of a skip graph. For example, one may replace the topΘ(loglogN) levels of the tree or other data structure with a skip graph.This would reduce the concurrency without changing the asymptoticbehavior of the dictionary.

Alternatively, additional concurrency can be achieved by having multipledisks. For example, one could use striping across all disks to makeeffectively bigger tree blocks. Alternatively, one could divide up thesearch space according to keys so that different keys are stored ondifferent disks.

DUP and NODUP

The system can handle both NODUP and DUP dictionaries.

-   -   1. No duplicate keys allowed (NODUP). This means that no two        key-value pairs that are stored in the dictionary at the same        time can have keys that compare as identical.    -   2. Duplicates keys allowed (DUP). This means that two key-value        pairs that are stored in the dictionary at the same time are        allowed to have keys that compare as identical, but when the        keys compare as identical, the associated values must not        compare as identical.

Duplicates are stored logically in sorted order. Specifically, key-valuepairs are first sorted by key. All elements with the same key are sortedin order by value.

The following are examples of functions that are supported withduplicate keys.

-   -   1. INSERT (k,v′): Inserts a key-value pair (k, v). If there        already exists a key-value pair (k′, v′), where k′=k and v′=v,        then there are several choices depending on how flags are set.        Either (k′, v′) is overwritten or it is not. In either case, a        call-back function may be called. Although v and v′ are compared        as equal, their values considered as byte strings may be        different.    -   2. DELETE(k): Deletes a key k. In this case, all key-value pairs        (k′, v) such that k′=k are deleted.    -   3. DELETE (k, v′): Delete a key-value pair (k, v). Any key-value        pair (k′, v′) in the dictionary with k′=k and v′=v is deleted.    -   4. Cursor delete. The key-value pair that the cursor points to        is deleted.

5. Cursor replace with v′. If the key-value pair (k, v) that is pointedto by the cursor has v′=v, then it is replaced with (k, v′).

-   -   6. Search for a particular key k. The first or last key-value        pair (k′, v), where k′=k, is returned (if one exists) for one        setting of flags. For another setting of flags, the search        returns a cursor that points to (k′, v).    -   7. Search for a particular key-value pair (k, v). If a key-value        pair (k′, v′) is in the dictionary, such that k′=k and v′=v,        then return (k′, v′).    -   8. Find the predecessor or successor of key k or a key-value        pair (k, v), if it exists. The search could also find a        predecessor or successor key-value pair, if it exists.

In one mode the system employs PMAs that operate in a DUP or a NODUPmode. For example, when duplicate nodes are inserted into a PMA, theyare put in the appropriate place in the PMA, as defined by the orderingof pairs.

In one mode the system employs hash tables that operate in a DUP or aNODUP mode. In a NODUP mode, the hash tables stores messages. In a DUPmode, the system employs an extra level of indirection in hash tables,storing doubly-linked lists of messages. Messages are be hashed by key kand all messages associated with the same key k are stored in the samedoubly-linked list. The hash function used maps keys k and k′ to thesame bucket if k=k′.

In DUP mode the system allocates a hash table with a number of bucketsproportional to the number of distinct key equivalence classes.

In another mode, the system uses a hash table in DUP mode, in which thesystem hashes both the key and the value.

The system stores key-value pairs in search trees. In a search tree, thesystem employs pivot keys that are comprise in a NODUP mode and thatcomprise key-value pairs in a DUP mode.

In DUP mode, the subtrees to the left of a pivot key contain pairs thatare less than or equal to the pivot key. The subtrees to the right ofthe pivot key contains pairs that are greater than or equal to the pivotkey. The nodes of the tree further comprise two additional Booleans,called equality bits. The equality bits indicate whether there exist anyequal keys to the left and to the right of the pivot respectively.

To search, the system uses both the pivots and equality bits todetermine which branch to follow to find the minimum or maximumkey-value pair for a given key.

When a delete message is flushed from one buffer, the message is sent toall children that may have a matching key. All the duplicates areremoved. For a cursor delete, the system deletes the item that isindicated by the cursor.

To insert, the system can use both the key and values to determine thecorrect place to insert key-value pairs.

In one mode the system handles duplicates with identical values, calledDUPDUP pairs. In DUPDUP mode when a key-value pair is inserted, wherethat key-value pair is a DUPDUP of another key-value pair in thedictionary, then there are one or more cases for what can happen,depending on how flags are set. For example:

-   -   1. Overwrite: One DUPDUP pair overwrites a previous one.    -   2. No overwrite: One DUPDUP pair does not overwrite a previous        one, instead the previous one is kept, and the new one is        discarded.    -   3. Keep: both pairs are kept.

Alternatively, there are other ways of storing DUP and DUPDUP pairs. Forexample, duplicates could be stored in sorted order according to thetime that they were inserted or they could be stored in an arbitraryorder. For example, if the size of two rows with the same key isdifferent, then a larger or smaller row might be pushed in preference tothe other.

Alternatively, these other orders can be maintained with minormodifications to the system described here. For example, to store pairsin sorted order based on insertion time, add a time stamp, in additionto the key and the value, and sort first by key, then by time stamp, andthen by value, thereby organizing duplicate duplicates for storage.Other types of unique identifiers, time stamps, and very minormodifications to the search function also can be used in other ways ofstoring duplicates.

Multiple Disks

The system can use one or many disks to store data. In one mode thesystem partitions the key space among many disks. Which disk stores aparticular key-value pairs depends on which disk (or disks) isresponsible for that part of the key space.

This scaling is achieved partially through a partition layer in thesystem. The partition layer determines which key-value pairs get storedon which disks.

The partition layer uses compact partitioning, or partitioning forshort. In compact partitioning, the key space is dividedlexicographically. For example, if there are 26 processor-disk systemsand the keys being stored are uniformly distributed starting withletters ‘A’-‘Z’, then the first processor-disk could contain all thekeys starting with ‘A’, the second could contain the keys starting with‘B’, and so forth. In this example, the keys are uniformly distributed.We describe here compact partitioning schemes that are designed to workefficiently even when the keys are not distributed uniformly.

FIG. 53 illustrates compact partitioning for a system with two disks(5301). A PMA (5303) is distributed across the disks (5301) as shown at(5302). In this case, the values a through e are stored at offsets 0-4respectively (all in the first disk), and the values g and h are storedat offsets 10 and 12 respectively (at the end of the second disk). Ifthe system inserts f at offset 5, (as shown at (5304)) then the firstdisk is determined to be above its density threshold, and the systemredistributes data (as shown at (5305)).

In one mode the system employs PMA-based compact partitioning. In thismode the key space is partitioned lexicographically, assigning eachpartition to one disk cluster. Recall that a PMA is an array of sizeΘ(N), which dynamically maintains N elements (key-value pairs) in sortedorder. The elements are kept approximately evenly spaced with gaps.

The system establishes a total order on the disks compatible with thedictionary, meaning that if disk A is before disk B in the total order,then all elements (key-value pairs) stored on disk A arelexicographically before all elements stored on disk B. These disks inorder form a virtual array of storage whose length is the capacity of adisk system or subsystem. We treat this virtual array as a PMA storingall elements. When an element moves from part of the array associatedwith one disk to part of the array associated with another disk, thenthat element is migrated between disks.

The system chooses the rebalance interval so that it only overlaps theboundary between one disk and the next if that disk is nearly full.Alternatively, the rebalance interval can be chosen so that it crossesthe boundary between one disk and the next when one disk has asubstantially higher density than a neighbor.

The system's linear ordering of the disks takes into account thedisk-to-disk transfer costs. For example, it is often cheaper to movedata from a disk to another disk on the same machine than it is to disksresiding elsewhere on a network. Consider a transfer-cost graph G, inwhich the nodes are disks, and the weight on edge is some measure of thecost of transferring data. This weight can take into account thebandwidth between two disks, or the weighted bandwidth that is reducedif many disks need to share the same bus or other interconnect link.Alternatively, the system could also take into account the latency oftransfer between disks. For example, the weighting function can decreasewith increasing connectivity.

Alternatively, one disk could simulate several smaller disks in the PMAof disks. For example, if large disks are partitioned into smallervirtual disks, and then the disks are ordered for the PMA layout, onemight choose for different virtual disks from the same disk not to beadjacent in the PMA order. Thus, the PMA could be made to wrap aroundthe disks several times, say, for the purposes of load balancing. Suchwrapping could, for example, allow the system to employ some subset ofdisks serve as a RAID array, with data striping across the RAID.

Alternatively, the system could accommodate disks of different sizes.

Alternatively, there are many choices for choosing a linear order on thedisks. For example, a traveling salesman problem (TSP) solution for G(or an approximate TSP solution) can be used to minimize the total costof edges traversed in a linearization. Or a tour on a minimum (or other)spanning tree of G can be used. Or the system could choose an orderingthat is approximately optimal, for example an ordering that can beproved to be within a factor of two of optimal.

In one mode, the system employs “disk recycling”. In this mode, thesystem does not keep a total order on disks. Instead, a total order iskept on a subset of disks and other disks are be kept in reserve. If aregion of key space stored on a disk becomes denser than a particularthreshold, a reserved disk is deployed to split the keys with theoverloaded disk. If another region of key space stored on a disk becomessparser than a particular threshold, elements are migrated off theunderused disk, and the newly empty disk can be added to the reserve.

FIG. 54 illustrates disk recycling. Disks are labeled A, B, C, and D.The first column (5401) shows the state of the disks before inserting,and the second column (5402) shows the state of the disks afterinserting and rebalancing. Initially Disk B (5404) holds keys a throughj. Disk A (5403) holds keys n through z. Disks C (5405) and D (5406) arefree. This illustrates that the ordering of the disks imposed by theoperating system or file system (A, B, C, D) may be different than theorder imposed by the PMA (B, A, with C and D not ordered). If k isinserted into disk B (5408) and it becomes overfull, then disk D (5410)can be used, so that the PMA-induced ordering is B, D, A, with disk B(5408) holding a through f, disk D (5410) holding g through k, and diskA (5407) still holding n through z, and disk C (5409) free, as shown at(5402).

In one mode the system employs an adaptive PMA (APMA). In an APMA, thesystem keeps a sketch of recent insertion patterns in order to learn theinsertion distribution. The sketch allows the system to leave extraspace for further insertions in likely hot spots.

In one mode the system replaces the PMA over the entire array with anAPMA. In the case of disk recycling, the system uses an APMA over allthe disks, rather than the elements, to predict where to deploy sparedisks. Since an APMA rebalances intervals unevenly, leaving someinterval relatively sparse, the recycled disks can take the role ofsparse intervals.

FIG. 55 illustrates disk recycling in an APMA mode. This figure is thesame scenario as for FIG. 54, except that in this example, the systemexpects that there will be many insertions between k and n, so it movedlittle data from disk B to disk D so that disk D can accept moreinsertions without migrating data. Initially the disks (5501) are in thesame state as for the initial state (5401) in FIG. 54. That is disks A(5503), B (5504), C (5505), and D (5506) are in the same state as disksA (5403), B (5404), C (5405), and D (5406) respectively.

After rebalancing (5502) disk D (5510) contains keys j-k instead of g-k.Disk B (5508) contains keys a-i, and Disk A (5507) contains keys n-z.Disk C (5509) is free.

Alternatively, the disk-to-disk rebalancing system could move elementsin the background, during idle time, during queries, or at other times,for example to improve hot-spot dissipation.

Alternatively, the system could group together several smaller disks tosimulate a larger disk. For example, these disk groups can divide uptheir allotted key space by consistent hashing (hashing for short),where keys are hashed to disks at random, or nearly at random, and anstreaming dictionary could be maintained on each disk.

When keys are hashed this way, host spots are diffused across all disksparticipating in the hashing scheme. If the system cannot predict wherea successor or predecessor lies, then the system can replicate queriesacross all the disks when performing successor or predecessor queries.

In a hybrid scheme, if each group has k disks, the system can employ thebandwidth of all k disks to diffuse a hot spot, and the system can limitthe replication of queries to these k disks. When the dynamicpartitioning scheme changes a partition boundary, thus causing items tomove from one partition to another, the system can delete the items fromk disks and insert them onto k other disks. The parameter k is tunable,and the system can increase insertion scaling by increasing k, whereasthe system can increase query scaling by decreasing k. Finally, theparameter k need not be fixed for all clusters.

An alternative approach is to reserve j disks as a buffer. Keys arefirst inserted into the buffer disks, and these are organized byhashing. The remaining disks are organized by partitioning. As keys areinserted into the buffer, keys are removed from the buffer and into thepartitioned disks. If the system detects a particularly large burst ofinsertions into a narrow range of keys, it can recycle disks into thatpart of the key space to improve the performance of the partitioneddisks.

In this approach, queries can be performed once on the partitioneddisks, and replicated j-fold in the hashed buffer disks.

Alternatively, compact partitioning can be used for other kinds ofdictionaries and data storage systems.

Buffer Flushing as Background Process

In one mode, the system performs buffer flushing as a backgroundprocess. That is, during times in which the disks and processors arerelatively idle, the system selects buffers and preemptively flushesthem.

To implement background buffer flushing the system maintains a priorityqueue, auxiliary dictionary, or other auxiliary structure storing someor all of the buffers in the tree that need to be flushed. When the CPU,memory system, and disk system have spare capacity (e.g., because theyare idle), the system consults the auxiliary structure, bringing nodesinto RAM, and flushing the relevant buffers.

This auxiliary structure is maintained along with the tree, but it ismuch smaller. When the buffers in the tree are changed, then so does theauxiliary structure. The auxiliary structure could be stored exclusivelyin RAM, or in some combination of RAM and disk.

Alternatively, there are many ways to prioritize the buffers that needto be flushed. Examples include, but are not limited to

-   -   1. giving higher priority to buffers that contain more elements,    -   2. giving higher priority to buffers that are fuller,    -   3. giving higher priority to nodes that contain less available        space,    -   4. giving higher priority to buffers that were modified or read        recently,    -   5. giving higher priority to nearly full buffers that are higher        up in the tree,    -   6. giving higher priority to nodes whose flushes would not        overflow their children, and    -   7. combinations of those priorities.

Alternatively, there are other ways of keeping track of which nodes needflushing. For example, the system could keep not all nodes from the maintree in the auxiliary structure, but instead, only keep those buffersthat are getting full and in need of flushing. Then, when there is idletime, the system could consult this smaller structure. The buffers couldbe flushed in one of the orders described above or in an arbitraryorder. Other strategies could also be used.

Alternatively, background buffer flushing can apply to other streamingdictionaries, including but not limited to those that are nottree-based, including but not limited to a COLA, a PMA, or an APMA. Fora COLA, the system can preemptively flush regions of levels that aregetting dense. A PMA or an APMA might selectively flush a level of therebalancing hierarchy.

Overindexing

In one mode the system implements overindexing. Recall that a nonleafnode has a sequence of keys

k₁, . . . , k_(a)

and pointers

p₀, . . . , p_(a)

to children. All keys k<k₁ belong on the path going through the childpointed at p₀. All keys k, k_(i)≦k<k_(i+1), belong on the path goingthrough the child pointed at p_(i).

In an overindexing mode, a node that is the parent of leaves keeps alarger sequence

k _(0,1) ,k _(0,2) , . . . ,k _(0,b) ,k _(1,1) , . . . ,k _(a,b)

of monotonically increasing keys, where k_(i,1)=k_(i) above. Similarlythe pointers are augmented to the sequence

p _(0,1) ,p _(0,2) , . . . ,p _(0,b) ,p _(1,1) , . . . ,p _(a,b)

where p_(i,1)=p_(i) above. For every i, pointers p_(i,1) to p_(i,b)point to different places in the same leaf. If some element (k, v) inchild c has the smallest k such that k_(i,j)≦k<k_(i,j+1), then p_(i,j)points to the location of (k, v) in c.

The choices of keys k_(i,j) are made so as to split the elements of eachleaf into parts that are sized within a factor four of each other.

In a system with overindexing, the system fetches only an approximately1/b fraction of a leaf that contains the element of interest.

Alternatively, the pivots keys might be chosen not to evenly split bythe number of elements in a leaf, but to approximately evenly split thesums of their sizes, or the probability of searching between two keys,or the probability of searching between two keys, weighted by the sizesof the elements, where the probability of accessing elements or subsetsof elements can be given or measured or some combination thereof.

Furthermore, b need not be the same constant for each leaf.

Alternatively, nodes higher than leaf-parents can have overindexing, andin this case, the overindexing pointers might point to grandchildren. Inthis case, the buffers in overindexed nodes might be partitionedaccording to the overindexing pivot keys. Then, if some such subbuffergrows large enough, the elements in a subbuffer could be flushed to agrandchild, rather than to the child.

Loader

The system includes a loader that can load a file of data into acollection of dictionaries. The system also sometimes uses the loaderfor other purposes, including but not limited to creating indexes andrebuilding dictionaries that have been damaged.

The loader is a structure that transforms a sequence of rows into acollection of dictionaries.

The loader is given a sequence of rows; information that the loader usesto build a set of zero or secondary indexes; and a sort function for theprimary rows and for each secondary index. The loader then generates allof the key-value pairs for the secondary indexes; sorts each index andthe primary row; forms the blocks, compressing them; and writes theresulting dictionary or dictionaries to a file. The system usesmultithreading in two ways: (1) The system overlaps I/O and computation,and (2) the system uses parallelism in the compute-part of the workload.The parallelizable computation includes, but is not limited tocompressing different blocks, and implementing a parallel sort.

The loader can create a table comprising a primary dictionary and zeroor more secondary dictionaries. A table row is a row in a SQL table,which is represented by entries in one or more dictionaries. To insert atable row can require inserting many dictionary rows, including but notlimited to the primary dictionary row and for each index a secondarydictionary row. Thus, for example, in a table with five indexes, asingle table insertion might require six dictionary insertions.

When inserting data, the system passes the primary row to the loader.The loader constructs the various dictionary rows from the primary row,sorts the dictionary rows, and builds the dictionaries.

One way to understand how the loader fits into a database SQL is as adata pipeline illustrated in FIG. 56. The pipeline illustrated iscreating one primary dictionary and two secondary dictionaries. Datastarts in a data source (5601), and is fed to a buffer (5602) one row ormore at a time. In parallel there are three extractors: the primaryextractor (5603) extracts a primary key-value pair for each row; theindex A extractor (5604) extracts a key-value pair for a first index,called index A, for each row; and the index B extractor (5605) extractsa key-value pair for a second index, called index B, for each row. Foreach dictionary, after data has been extracted by its respectiveextractor, the data enters a sorter (5606), which sorts the data intothe order specified by the dictionary. After being sorted, the dataenters a blocker (5607) which forms data into nodes of a dictionary.Each node is then passed to a comp/write (5608) module which compressesthe data and writes it out. All of the comp/write (5608) modules can runin parallel. The pipeline illustrated here specifies the abstractparallelism of the computation, and the system employs a scheduler toschedule the various modules onto particular processors. For example, ina system that has four processors, the scheduler might choose toactually run only 4 modules in parallel at a given time. There isparallelism within the modules, for example, within the sorter (5606),the system employs a parallel merge sort. In the example above with fourprocessors, the scheduler might allocate two processors to the sorter(5606) and run two other modules in parallel. If a module completes, thescheduler might take the processor that was working on that module andapply it to a previously unscheduled module, or it might apply theprocessor to a module that can employ more parallelism, including butnot limited to the sorter (5606). The scheduler dynamically allocatesprocessors to work on the modules, adapting the number of processorsallocated to a particular module even as the module is running.

Having described the preferred embodiment as well as other embodimentsof the invention it will now become apparent to those of ordinary skillin the art that other embodiments incorporating these concepts may beused.

1-13. (canceled)
 14. A computer program product, comprising a computerusable medium having a computer readable program code embodied therein,said computer readable program code adapted to be executed to implementa method for storing data, said method comprising: providing a system,wherein the system comprises distinct software modules, and wherein thedistinct software modules comprise a front-end module, a buffer-poolmodule, a locking module, and a file-maintenance module; one or moredata files being maintained by the file-maintenance module; where a datafile comprises a header, and one or more blocks; where zero or more ofthe blocks comprise bytes encoding leaf nodes, and zero or more of theblocks comprise bytes encoding nonleaf nodes; where a header comprises aroot block number identifying a root block; where the block identifiedby the root block number comprises an encoding of a leaf node or anonleaf node; where a nonleaf node comprises (a) a counter c, (b) one ormore child block numbers, where the number of child block numbers equalsc, each child block number identifying a block comprising a leaf node ornonleaf node, the children being numbered consecutively from 0 to c−1,(c) zero or more pivots, where the number of pivots equals one less thanthe number of child block numbers, the pivots being numberedconsecutively from 0 to c−2, and (d) one or more message queues, wherethe number of message queues equals c, the message queues being numberedconsecutively from 0 to c−1, and (e) a descriptor version number; wherea leaf node comprises (a) a sorted array of zero or more leaf entries,and (b) a checksum, and (c) a descriptor version number; the leaf andnonleaf nodes providing a tree structure, in which the leaves of thetree comprise the leaf nodes and the internal nodes of the tree comprisethe nonleaf nodes, where the root block number identifies the root ofthe tree, and for each nonleaf node, the child block numbers of the nodeidentify the respective children of the node; where a pivot comprises akey; where a key comprises a length and a sequence of zero or morebytes; where a value comprises a length and a sequence of zero or morebytes; where a key-value pair comprises a key and a value; where amessage queue comprises a FIFO of zero or more messages; where the FIFOof messages comprises a counter f and a sequence of zero or messageswhere the number of messages in the sequence equals f; where the FIFO ofmessages supports operations comprising enqueueing a message anddequeueing the least recently enqueued message; where a messagecomprises (a) a message operation, (b) a key, and (c) a value; where aleaf entry comprises (a) a value, and (b) a key; where a messageoperation comprises an insert operation, a delete operation, an abortoperation, and a commit operation; where, when the front-end moduleexecutes a insert of a key-value pair into a file in a transaction, thefront-end module causes the locking module to acquire a exclusive lockon the key, and the front-end module creates a message in which themessage operation comprises an insert operation, the key length and keycomprise the key of the key-value pair, the value length and valuecomprise the value of the key-value pair, the front-end module thencausing the file-maintenance module to insert the message into the rootblock identified in the header of the file; inserting a message intononleaf node comprising (a) identifying which children the key of themessage is to be delivered to, (b) inserting the message into the FIFOsof the message queues respectively corresponding to the identifiedchildren, (c) determining if the total size of the node is larger than aspecified value, and if so then flushing message queues of one or morechildren; identifying which child a message to is be delivered to in anonleaf node comprising comparing the pivot keys of the nonleaf node tothe key of the message, thus identifying one or more children that canhold the message; flushing message queues of one or more childrencomprising selecting a nonempty message queue, fetching thecorresponding child block into main memory using the buffer-pool module,and removing one or more messages from the message queue and insertingsaid messages into the child block; inserting a message into a leafnode, the insertion comprising (a) fetching from the leaf node the leafentry with the same key as the message, if such leaf entry exists,otherwise constructing an empty leaf entry, (b) if the message operationcomprises an insertion operation then i. constructing a leaf entry witha key taken from the message, a provisional value taken from the valueof the message, and a committed value taken from the previously existingleaf entry with the same key if such leaf entry exists, otherwisesetting the new leaf entry to have no committed value, ii. deleting theprevious leaf entry if it exists, and iii. storing the new leaf entry inthe leaf node, (c) if the message operation comprises a delete operationthen i. constructing a leaf entry that is a provisional delete, with thecommitted value taken from the previously existing leaf entry with thesame key if such leaf entry exists, ii. deleting the previous leaf entryif it exists, and iii. storing the new leaf entry in the leaf node,where when a leaf node is modified, it becomes smaller than a specifiedsize, and if there is a left or right sibling to the leaf, then the leafand the sibling are merged into one leaf node, and the parent of the twoleaf nodes is modified to remove the pivot that separates the two nodesand to point at the one remaining node instead of the two original leafnodes; where when a leaf node is modified, if it becomes larger than aspecified size, the leaf node is split by the file-maintenance moduleinto two leaf nodes, and the parent of the leaf node, if it exists, ismodified to include a new pivot key that separates the two leaf nodes,and to include pointers to the two leaf nodes, and if there is no parentof the leaf node, a new root node is created with the new pivot keys andthe pointers to the leaf nodes; where when a nonleaf node is modified toinclude an additional pivot and pointer, if the fanout of the nonleafnode therefore becomes larger than a specified value, then the nonleafnode is split into two nonleaf nodes, and the parent of the nonleafnode, if it exists, is modified to include a new pivot key thatseparates the two nonleaf nodes, and to include pointers to the twononleaf nodes, and if there is no parent of the nonleaf node, a new rootnode is created with the new pivot keys and the pointers to the leafnodes; where when a nonleaf node is modified to remove a pivot andpointer, if the fanout of the nonleaf node therefore becomes smallerthan a specified value, and if there is a sibling to the nonleaf node,the nonleaf node and its siblings are merged, and their parent isupdated to remove the pivot that separates the two nodes and to point atthe one remaining node instead of the two original leaf nodes; wherewhen the front-end module executes a search on a particular key in afile in a transaction, the front-end module causes the locking module toacquire a shared lock on the key and causes the file-maintenance moduleto traverse the tree, searching the root node for the key; wheresearching a nonleaf node for a key comprises identifying thesmallest-numbered child that can hold the key as the next node on thepath, flushing the message queue of that child, and then searching thechild node; where searching a leaf node comprises fetching from the leafnode the leaf entry with the same key as the message, if such keyexists, and returning the associated value from the leaf entry; where achecksum on a sequence of bytes is calculated by organizing the bytesinto 8-byte values, interpreting the 8-byte values as integers, andsumming the product of the ith 8-byte value with seventeen raised to thepower of i to produce a checksum; where a data file header furthercomprises a descriptor; where a descriptor comprises a version number, alength, and a sequence of zero or more bytes, the number of bytes beingequal to the length; where when the front-end module creates or opens afile, the front-end module provides a comparison function that, giventwo keys and the descriptor from the file header, compares the two keysto determine whether the two keys are to be considered equal or elsewhich one is to be ordered ahead of the other, thus ordering key-valuepairs; where the file-maintenance module maintains the tree of each fileso that in a nonleaf node, for each given a pivot key numbered i thenode, the pivot key is ordered to be greater than or equal to any of thekey-value pairs stored in child subtrees or message queues numbered lessthan or equal to i, and the pivot key is ordered to be less than orequal to any of the key-value pairs stored in child subtrees or messagequeues numbered greater than i, and the pivot key numbered i is orderedto be before any pivot keys numbered greater than i; where when thefront-end module creates a file, the front-end module specifies adescriptor causing the file-maintenance module to maintain thedescriptor version number in each node in the file. 15-23. (canceled)24. A computer system comprising a processor; a main memory; a secondarymemory; where one or both of the memories contains an encodedapplication for storing data on the secondary memory, that when theapplication is performed on the processor, provides a process forprocessing information, the process causing the computer system toperform the operations of organizing data in a data storage system, saidprocess comprising: providing a system, wherein the system comprisesdistinct software modules, and wherein the distinct software modulescomprise a front-end module, a buffer-pool module, and afile-maintenance module; one or more data files being maintained by thefile-maintenance module; where a data file comprises a header, and oneor more blocks; where zero or more of the blocks comprise bytes encodingleaf nodes, and zero or more of the blocks comprise bytes encodingnonleaf nodes; where a header comprises a root block number identifyinga root block; where the block identified by the root block numbercomprises an encoding of a leaf node or a nonleaf node; where a nonleafnode comprises (a) a counter c, (b) one or more child block numbers,where the number of child block numbers equals c, each child blocknumber identifying a block comprising a leaf node or nonleaf node, thechildren being numbered consecutively from 0 to c−1, (c) zero or morepivots, where the number of pivots equals one less than the number ofchild block numbers, the pivots being numbered consecutively from 0 toc−2, and (d) one or more message queues, where the number of messagequeues equals c, the message queues being numbered consecutively from 0to c−1, and (e) a descriptor version number; where a leaf node comprises(a) a sorted array of zero or more leaf entries, and (b) a checksum, and(c) a descriptor version number; the leaf and nonleaf nodes providing atree structure, in which the leaves of the tree comprise the leaf nodesand the internal nodes of the tree comprise the nonleaf nodes, where theroot block number identifies the root of the tree, and for each nonleafnode, the child block numbers of the node identify the respectivechildren of the node; where a value comprises a length and a sequence ofzero or more bytes; where a pivot comprises a key; where a key comprisesa length and a sequence of zero or more bytes; where a key-value paircomprises a key and a value; where a message queue comprises a FIFO ofzero or more messages; where the FIFO of messages comprises a counter fand a sequence of zero or messages where the number of messages in thesequence equals f; where the FIFO of messages supports operationscomprising enqueueing a message and dequeueing the least recentlyenqueued message; where a message comprises (a) a message operation, (b)a key, and (c) a value; where a leaf entry comprises (a) a value, and(b) a key; where a message operation comprises an insert operation, adelete operation, an abort operation, and a commit operation; where,when the front-end module executes a insert of a key-value pair into afile in a transaction, and the front-end module creates a message inwhich the message operation comprises an insert operation, the keylength and key comprise the key of the key-value pair, the value lengthand value comprise the value of the key-value pair, the front-end modulethen causing the file-maintenance module to insert the message into theroot block identified in the header of the file; inserting a messageinto nonleaf node comprising (a) identifying which children the key ofthe message is to be delivered to, (b) inserting the message into theFIFOs of the message queues respectively corresponding to the identifiedchildren; identifying which child a message to is be delivered to in anonleaf node comprising com-paring the pivot keys of the nonleaf node tothe key of the message, thus identifying one or more children that canhold the message; inserting a message into a leaf node, the insertioncomprising (a) fetching from the leaf node the leaf entry with the samekey as the message, if such leaf entry exists, otherwise constructing anempty leaf entry, (b) if the message operation comprises an insertionoperation then i. constructing a leaf entry with a key taken from themessage, a provisional value taken from the value of the message, and acommitted value taken from the previously existing leaf entry with thesame key if such leaf entry exists, otherwise setting the new leaf entryto have no committed value, ii. deleting the previous leaf entry if itexists, and iii. storing the new leaf entry in the leaf node, (c) if themessage operation comprises a delete operation then i. constructing aleaf entry that is a provisional delete, with the committed value takenfrom the previously existing leaf entry with the same key if such leafentry exists, ii. deleting the previous leaf entry if it exists, andiii. storing the new leaf entry in the leaf node, where when a nonleafnode is modified to include an additional pivot and pointer, if thefanout of the nonleaf node therefore becomes larger than a specifiedvalue, then the nonleaf node is split into two nonleaf nodes, and theparent of the nonleaf node, if it exists, is modified to include a newpivot key that separates the two nonleaf nodes, and to include pointersto the two nonleaf nodes, and if there is no parent of the nonleaf node,a new root node is created with the new pivot keys and the pointers tothe leaf nodes; where when the front-end module executes a search on aparticular key in a file in a transaction, the front-end module causesthe file-maintenance module to traverse the tree, searching the rootnode for the key; where searching a nonleaf node for a key comprisesidentifying the smallest-numbered child that can hold the key as thenext node on the path, and then searching the child node; wheresearching a leaf node comprises fetching from the leaf node the leafentry with the same key as the message, if such key exists, andreturning the associated value from the leaf entry; where a checksum ona sequence of bytes is calculated by organizing the bytes into 8-bytevalues, interpreting the 8-byte values as integers, and summing theproduct of the ith 8-byte value with seventeen raised to the power of ito produce a checksum; where a data file header further comprises adescriptor; where a descriptor comprises a version number, a length, anda sequence of zero or more bytes, the number of bytes being equal to thelength; where when the front-end module creates a file, the front-endmodule specifies a descriptor causing the file-maintenance module tomaintain the descriptor version number in each node in the file. 25-128.(canceled)
 129. A system comprising a structure comprising nodes, nodescomprising leaf nodes and nonleaf nodes, where a leaf node compriseskey-value pairs, and a non-leaf node comprises two or more pointers tochild nodes, where each child node is the root of a child subtree, wherethe keys stored in each child subtree are ordered to be before the keysstored in any subsequent child subtree, where c equals the number ofchild nodes; a non-leaf node further comprising pivot keys, where thenumber of pivot keys equals c−1, where the ith pivot key is greater thanor equal to the keys stored in the ith child, and less than or equal tothe keys stored in the i+1st child; a non-leaf node further comprisingpivot keys from its child nodes and optionally further comprisingpointers to some grandchild nodes; non-leaf nodes further comprisingmessage queues, where the number of message queues equals c; eachmessage queue comprising a collection of messages; each messagecomprising an insert message or a delete message; an insert messagecomprising a key-value pair; a delete message comprising a key; the keysin the ith queue being ordered between the ith pivot key and the i+1stpivot key, or if i is the first queue, the keys in the ith queue beingordered before the first pivot key, or if i is the last queue, the keysin the ith queue being ordered after the last pivot key, the systemtransforming the structure by moving messages from a nonleaf node intothe root node of the corresponding child subtree; moving a message intoa nonleaf node comprising determining which pair of pivot keys keys inthe node the key of message is between, and inserting the message intothe corresponding message queue; moving a message into a leaf nodecomprising deleting from the leaf node any pair matching the key of themessage if the message comprises a delete message; moving a message intoa leaf node comprising deleting from the leaf node any pair matching thekey of the message, and inserting the key-value pair of the message intothe leaf node if the message comprises an insert message; the structurefurther comprising multiple root nodes; some nonleaf nodes beingincluded as child nodes of more than one nonleaf node; the systemsearching the structure for a key by searching a root node; searching aleaf node for a key comprising returning a key-value pair that match thekey, if said pair exists, otherwise returning no key; searching anonleaf node for a key comprising (a) identifying a grandchild node bycomparing the key to the pivot keys of the child nodes, and (b) if thepointer to the grandchild node is present, then searching the grandchildnode for the key and returning the result of the search, (c) identifyinga child number by comparing the key to the pivot keys to determine anumber i such that the key is ordered between the ith and the i+1stpivot key, or that the first pivot key is ordered after the key, or thatthe last pivot key is ordered after the last key, (d) searching the ithchild node for the key.
 130. A system comprising a structure comprisingnodes, nodes comprising leaf nodes and nonleaf nodes, where a leaf nodecomprises key-value pairs, and a non-leaf node comprises two or morepointers to child nodes, where each child node is the root of a childsubtree, where the keys stored in each child subtree are ordered to bebefore the keys stored in any subsequent child subtree, where c equalsthe number of child nodes; a non-leaf node further comprising pivotkeys, where the number of pivot keys equals c−1, where the ith pivot keyis greater than or equal to the keys stored in the ith child, and lessthan or equal to the keys stored in the i+1st child; non-leaf nodesfurther comprising message queues, where the number of message queuesequals c; each message queue comprising a collection of messages; eachmessage comprising an insert message or a delete message; an insertmessage comprising a key-value pair; a delete message comprising a key;the keys in the ith queue being ordered between the ith pivot key andthe i+1st pivot key, or if i is the first queue, the keys in the ithqueue being ordered before the first pivot key, or if i is the lastqueue, the keys in the ith queue being ordered after the last pivot key,the system transforming the structure by moving messages from a nonleafnode into the root node of the corresponding child subtree; moving amessage into a nonleaf node comprising determining which pair of pivotkeys keys in the node the key of message is between, and inserting themessage into the corresponding message queue; moving a message into aleaf node comprising deleting from the leaf node any pair matching thekey of the message if the message comprises a delete message; moving amessage into a leaf node comprising deleting from the leaf node any pairmatching the key of the message, and inserting the key-value pair of themessage into the leaf node if the message comprises an insert message;the structure further comprising multiple root nodes; some nonleaf nodesbeing included as child nodes of more than one nonleaf node; the systemsearching the structure for a key by searching a root node; searching aleaf node for a key comprising returning a key-value pair that match thekey, if said pair exists, otherwise returning no key; searching anonleaf node for a key comprising (a) identifying a child number bycomparing the key to the pivot keys to determine a number i such thatthe key is ordered between the ith and the i+1st pivot key, or that thefirst pivot key is ordered after the key, or that the last pivot key isordered after the last key, (b) searching the ith message queue formessages matching the key, where if a message matching the key is adelete message, then returning no key, and if the message matching thekey is an insert message then returning the pair from the message, andif no key is found then (c) searching the ith child node for the key.131. A system comprising a structure comprising nodes, nodes comprisingleaf nodes and nonleaf nodes, where a leaf node comprises key-valuepairs, and a non-leaf node comprises two or more pointers to childnodes, where each child node is the root of a child subtree, where thekeys stored in each child subtree are ordered to be before the keysstored in any subsequent child subtree, where c equals the number ofchild nodes; a non-leaf node further comprising pivot keys, where thenumber of pivot keys equals c−1, where the ith pivot key is greater thanor equal to the keys stored in the ith child, and less than or equal tothe keys stored in the i+1st child; non-leaf nodes further comprisingmessage queues, where the number of message queues equals c; eachmessage queue comprising a collection of messages; each messagecomprising an insert message or a delete message; an insert messagecomprising a key-value pair; a delete message comprising a key; the keysin the ith queue being ordered between the ith pivot key and the i+1stpivot key, or if i is the first queue, the keys in the ith queue beingordered before the first pivot key, or if i is the last queue, the keysin the ith queue being ordered after the last pivot key, the systemtransforming the structure by moving messages from a nonleaf node intothe root node of the corresponding child subtree; moving a message intoa nonleaf node comprising determining which pair of pivot keys keys inthe node the key of message is between, and inserting the message intothe corresponding message queue; moving a message into a leaf nodecomprising deleting from the leaf node any pair matching the key of themessage if the message comprises a delete message; moving a message intoa leaf node comprising deleting from the leaf node any pair matching thekey of the message, and inserting the key-value pair of the message intothe leaf node if the message comprises an insert message; the structurefurther comprising multiple root nodes; some nonleaf nodes beingincluded as child nodes of more than one nonleaf node; the systemsearching the structure for a key by searching a root node; searching aleaf node for a key comprising returning a key-value pair that match thekey, if said pair exists, otherwise returning no key; searching anonleaf node for a key comprising (a) identifying a child number bycomparing the key to the pivot keys to determine a number i such thatthe key is ordered between the ith and the i+1st pivot key, or that thefirst pivot key is ordered after the key, or that the last pivot key isordered after the last key, (b) searching the ith child node for thekey.
 132. A system comprising a structure comprising nodes, nodescomprising leaf nodes and nonleaf nodes, where a leaf node compriseskey-value pairs, and a non-leaf node comprises two or more pointers tochild nodes, where each child node is the root of a child subtree, wherethe keys stored in each child subtree are ordered to be before the keysstored in any subsequent child subtree, where c equals the number ofchild nodes; a non-leaf node further comprising pivot keys, where thenumber of pivot keys equals c−1, where the ith pivot key is greater thanor equal to the keys stored in the ith child, and less than or equal tothe keys stored in the i+1st child; a non-leaf node further comprisingpivot keys from its child nodes and optionally further comprisingpointers to some grandchild nodes; non-leaf nodes further comprisingmessage queues, where the number of message queues equals c; eachmessage queue comprising a collection of messages; each messagecomprising an insert message or a delete message; an insert messagecomprising a key-value pair; a delete message comprising a key; the keysin the ith queue being ordered between the ith pivot key and the i+1stpivot key, or if i is the first queue, the keys in the ith queue beingordered before the first pivot key, or if i is the last queue, the keysin the ith queue being ordered after the last pivot key, the systemtransforming the structure by moving messages from a nonleaf node intothe root node of the corresponding child subtree; moving a message intoa nonleaf node comprising determining which pair of pivot keys keys inthe node the key of message is between, and inserting the message intothe corresponding message queue; moving a message into a leaf nodecomprising deleting from the leaf node any pair matching the key of themessage if the message comprises a delete message; moving a message intoa leaf node comprising deleting from the leaf node any pair matching thekey of the message, and inserting the key-value pair of the message intothe leaf node if the message comprises an insert message; the structurefurther comprising a root node; the system searching the structure for akey by searching a root node; searching a leaf node for a key comprisingreturning a key-value pair that match the key, if said pair exists,otherwise returning no key; searching a nonleaf node for a keycomprising (a) identifying a grandchild node by comparing the key to thepivot keys of the child nodes, and (b) if the pointer to the grandchildnode is present, then searching the grandchild node for the key andreturning the result of the search, (c) identifying a child number bycomparing the key to the pivot keys to determine a number i such thatthe key is ordered between the ith and the i+1st pivot key, or that thefirst pivot key is ordered after the key, or that the last pivot key isordered after the last key, (d) searching the ith child node for thekey.
 133. A system comprising a structure comprising nodes, nodescomprising leaf nodes and nonleaf nodes, where a leaf node compriseskey-value pairs, and a non-leaf node comprises two or more pointers tochild nodes, where each child node is the root of a child subtree, wherethe keys stored in each child subtree are ordered to be before the keysstored in any subsequent child subtree, where c equals the number ofchild nodes; a non-leaf node further comprising pivot keys, where thenumber of pivot keys equals c−1, where the ith pivot key is greater thanor equal to the keys stored in the ith child, and less than or equal tothe keys stored in the i+1st child; non-leaf nodes further comprisingmessage queues, where the number of message queues equals c; eachmessage queue comprising a collection of messages; each messagecomprising an insert message or a delete message; an insert messagecomprising a key-value pair; a delete message comprising a key; the keysin the ith queue being ordered between the ith pivot key and the i+1stpivot key, or if i is the first queue, the keys in the ith queue beingordered before the first pivot key, or if i is the last queue, the keysin the ith queue being ordered after the last pivot key, the systemtransforming the structure by moving messages from a nonleaf node intothe root node of the corresponding child subtree; moving a message intoa nonleaf node comprising determining which pair of pivot keys keys inthe node the key of message is between, and inserting the message intothe corresponding message queue; moving a message into a leaf nodecomprising deleting from the leaf node any pair matching the key of themessage if the message comprises a delete message; moving a message intoa leaf node comprising deleting from the leaf node any pair matching thekey of the message, and inserting the key-value pair of the message intothe leaf node if the message comprises an insert message; the structurefurther comprising a root node; the system searching the structure for akey by searching a root node; searching a leaf node for a key comprisingreturning a key-value pair that match the key, if said pair exists,otherwise returning no key; searching a nonleaf node for a keycomprising (a) identifying a child number by comparing the key to thepivot keys to determine a number i such that the key is ordered betweenthe ith and the i+1st pivot key, or that the first pivot key is orderedafter the key, or that the last pivot key is ordered after the last key,(b) searching the ith message queue for messages matching the key, whereif a message matching the key is a delete message, then returning nokey, and if the message matching the key is an insert message thenreturning the pair from the message, and if no key is found then (c)searching the ith child node for the key.
 134. A system comprising astructure comprising nodes, nodes comprising leaf nodes and nonleafnodes, where a leaf node comprises key-value pairs, and a non-leaf nodecomprises two or more pointers to child nodes, where each child node isthe root of a child subtree, where the keys stored in each child subtreeare ordered to be before the keys stored in any subsequent childsubtree, where c equals the number of child nodes; a non-leaf nodefurther comprising pivot keys, where the number of pivot keys equalsc−1, where the ith pivot key is greater than or equal to the keys storedin the ith child, and less than or equal to the keys stored in the i+1stchild; non-leaf nodes further comprising message queues, where thenumber of message queues equals c; each message queue comprising acollection of messages; each message comprising an insert message or adelete message; an insert message comprising a key-value pair; a deletemessage comprising a key; the keys in the ith queue being orderedbetween the ith pivot key and the i+1st pivot key, or if i is the firstqueue, the keys in the ith queue being ordered before the first pivotkey, or if i is the last queue, the keys in the ith queue being orderedafter the last pivot key, the system transforming the structure bymoving messages from a nonleaf node into the root node of thecorresponding child subtree; moving a message into a nonleaf nodecomprising determining which pair of pivot keys keys in the node the keyof message is between, and inserting the message into the correspondingmessage queue; moving a message into a leaf node comprising deletingfrom the leaf node any pair matching the key of the message if themessage comprises a delete message; moving a message into a leaf nodecomprising deleting from the leaf node any pair matching the key of themessage, and inserting the key-value pair of the message into the leafnode if the message comprises an insert message; the structure furthercomprising a root node; the system searching the structure for a key bysearching a root node; searching a leaf node for a key comprisingreturning a key-value pair that match the key, if said pair exists,otherwise returning no key; searching a nonleaf node for a keycomprising (a) identifying a child number by comparing the key to thepivot keys to determine a number i such that the key is ordered betweenthe ith and the i+1st pivot key, or that the first pivot key is orderedafter the key, or that the last pivot key is ordered after the last key,(b) searching the ith child node for the key.
 135. A system comprising aloader, the loader comprising distinct software modules, where thedistinct software modules comprise comprise a data source module, abuffer module, one or more primary extractor modules, one or more sortermodules, one or more blocker modules, and one or more compressor/writermodules, where the data source module is connected to the buffer module,the buffer module is connected to the primary extractor modules, eachextractor module is connected to a sorter module, each sorter module isconnected to a blocker module, and each blocker module is connected toone or more compressor/writer modules; where the loader module acceptsdata comprising rows from an external source and passes the data to thebuffer module, where the buffer module buffers the data, and passes acopy of the data to each extractor module, where an extractor moduletransforms each row into a new row, and passes the new row to a sortermodule, where a sorter module sorts all of the rows it receives, andpasses the sorted rows to a blocker module, where a blocker moduleorganizes the rows into blocks, the blocks comprising a tree-structureddictionary, and passes the blocks to a compressor/writer module, where acompressor/writer module compresses each block it receives and writesthe com-pressed block to a file.
 136. The system of claim 135 furthercomprising a scheduler that schedules the various modules to run inparallel honoring the dependencies between the modules.